Author: Hui Ma, Yiming Yang, Rimte Rocher
Date: 2022-03-09
Notebook Source: plotting_tutorial.ipynb
import pegasus as pg
For this plotting tutorial, we provide an analysis result of gene-count matrix dataset on Human Bone Marrow with 8 donors. You can get the data from https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip, or use gsutil to download via its Google bucket URL (gs://terra-featured-workspace/Cumulus/MantonBM_result.zarr.zip):
After downloading, load the file using Pegasus read_input
function:
data = pg.read_input("MantonBM_result.zarr.zip")
data
2022-03-09 00:07:00,912 - pegasusio.readwrite - INFO - zarr file 'MantonBM_result.zarr.zip' is loaded. 2022-03-09 00:07:00,913 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.85s.
MultimodalData object with 1 UnimodalData: 'GRCh38-rna' It currently binds to UnimodalData object GRCh38-rna UnimodalData object with n_obs x n_vars = 35465 x 25653 Genome: GRCh38; Modality: rna It contains 2 matrices: 'X', 'raw.X' It currently binds to matrix 'X' as X obs: 'n_genes', 'Channel', 'gender', 'n_counts', 'percent_mito', 'scale', 'louvain_labels'(cluster), 'anno' var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features', 'mean', 'var', 'hvf_loess', 'hvf_rank' obsm: 'X_diffmap', 'X_fle'(basis), 'X_pca', 'X_pca_harmony', 'X_phi', 'X_tsne'(basis), 'X_umap'(basis), 'diffmap_knn_distances'(knn), 'diffmap_knn_indices'(knn), 'pca_harmony_knn_distances'(knn), 'pca_harmony_knn_indices'(knn) varm: 'means', 'partial_sum', 'de_res' obsp: 'W_diffmap', 'W_pca_harmony' uns: 'genome', 'louvain_resolution', 'modality', 'norm_count', 'pca_features', 'stdzn_max_value', 'PCs', 'diffmap_evals', 'ncells', 'stdzn_mean', 'stdzn_std', '_attr2type', 'df_qcplot', 'pca'
In the following sections, we'll cover Pegasus plotting functions using this dataset. Moreover, for gene plots, the canonical gene markers below will be used:
marker_genes = ['CD38', 'JCHAIN', 'FCGR3A', 'HLA-DPA1', 'CD14', 'CD79A', 'MS4A1', 'CD34', 'TRAC', 'CD3D', 'CD8A',
'CD8B', 'GYPA', 'NKG7', 'CD4', 'SELL', 'CCR7']
The first step in preprocessing is to perform the quality control analysis, and remove cells and genes of low quality.
pg.qcviolin shows the effect of quality control more intuitively by presenting the violin plot of cell distribution before and after filtration.
plot_type='gene' shows the number of expressed cells before and after filtration.
pg.qcviolin(data, plot_type='gene', dpi=100)
Quality control stats on number of percentage of mitochondrial genes:
pg.qcviolin(data, plot_type='mito', dpi=100)
The number of UMIs before and after filtration is also an important aspect of quality control.
pg.qcviolin(data, plot_type='count', dpi=100)
Highly Variable Genes (HVG) are more likely to convey information discriminating different cell types and states. Thus, rather than considering all genes, people usually focus on selected HVGs for downstream analyses.
Pegasus provides hvfplot
function to generate a scatterplot of genes upon HVG selection. This plot only works for Pegasus-flavor HVGs (i.e. flavor='pegasus'
in Pegasus highly_variable_features
function).
After selecting 2000 HVGs using the Pegasus selection method, the plot below is generated. Each point stands for one gene. Blue points are selected to be HVGs, which account for the majority of variation of the dataset. By default, it prints labels of 20 top HVGs. You can change this number in top_n
parameter.
pg.hvfplot(data, dpi=200)
Composition plot is a bar plot showing the cell compositions (under different conditions) in each cluster. Below is to show the composition of different samples in each Louvain cluster:
fig = pg.compo_plot(data, 'louvain_labels', 'Channel', style = 'frequency')
Composition plot is useful to fast assess library quality and batch effects.
Scatter plot requires at least 2 parameters
For this demonstration, we select annotation and channel as data attributes, and tsne as basis.
pg.scatter(data, attrs=['anno','Channel'], basis='tsne')
You can show both cluster-specific cell types (anno
) and samples (Channel
) in one function.
Note here, the legend of left plot is not fully dislplayed. We need to adjust the horizon distance between two plots by changing "wspace"
"wspace" is the ratio of horizon distance to width of plots. For example ,the default value of wspace is 0.4, which means the gap is 0.4 times the width of plot.
pg.scatter(data, attrs=['anno','Channel'], basis='tsne', wspace=1.2)
There are also other optional parameters. For example, the commanly used ones
1.0
) – Alpha value for blending, from 0.0 (transparent) to 1.0 (opaque). If this is a list, the length must match attrs, which means we set a separate alpha value for each attribute.300.0
) – The resolution of the figure in dots-per-inch.For the plots below, Left one has alpha=1.0, while right one has alpha=0.1
pg.scatter(data, attrs=['Channel','Channel'], basis='tsne',alpha=[1.0,0.1], wspace=0.6)
Plots with lower dpi are smaller and have lower resolution.
pg.scatter(data, attrs=['Channel'], basis='tsne',dpi=50)
UMAP Plot. Pegasus also provides 2 different kinds of UMAP plots: umap
, or net_umap
.
The embedding methods can be changed by changing basis
pg.scatter(data,attrs=['anno', 'Channel'],basis='umap', wspace=1.2)
Cell Developmental Trajectory. Pegasus provides Force-directed Layout (FLE) based plots to show pseudo-trajectories of cell development. To show it, you need to first run diffmap
, then FLE-like visualization, fle
or net_fle
, and finally plot using scatter
with the corresponding basis:
pg.scatter(data,attrs='anno', basis='fle')
Besides plotting cell attributes in data.obs
, you can also specify gene names in attrs
parameter of scatter
function to generate Feature Plots, which is useful for inspecting specific genes in details.
Below are feature plots of 4 genes, together with UMAP plot of cell type annotations:
pg.scatter(data, attrs=['TRAC', 'CD79A', 'CD14', 'CD34', 'anno'], basis='umap', ncols=2)
As the plots show, TRAC is highly expressed in T-cell clusters, CD79A in B-cell and Plasma-cell clusters, CD14 in CD14+ Monocytes, and CD34 in the Stem-cell cluster.
By using scatter_groups
function, user can generate a set of scatter plots against a categorical cell attribute, for example Channel
, one plot regarding a category.
Below is an example
pg.scatter_groups(data, attr='louvain_labels', groupby='Channel', basis='tsne',
nrows = 3, ncols = 3, legend_fontsize=10)
There are 8 categories in Channel
attribute, since the dataset contains 8 samples. As a result, 9 FIt-SNE plots are generated: first one shows cells from all samples, followed by 8 plots, one of which only shows cells from one sample.
You can see that all the 8 sample-specific plots have similar cell distribution regarding clusters. This is due to the batch correction on the original data.
The violin plot gives us an idea of the distribution of gene expression values across clusters. The violin plot below shows the expression of the marker gene set we use in this tutorial against Louvain clusters:
pg.violin(data, attrs=marker_genes, groupby='anno', dpi=80)
Notice that all the genes' violin plots are stacked together. And you can check that their expression levels across clusters are pretty consistent with our knowledge.
Pegasus violin plot can also be split regarding a categorical cell attributes with 2 categories. For example, below is a plot showing expression of gene XIST and RPS4Y1 with cells split into male and female donors:
pg.violin(data, attrs=['XIST', 'RPS4Y1'], groupby='anno', hue='gender', dpi=100)
It's easy to see that in the plot above, XIST is highly expressed in female samples across all clusters, while RPS4Y1 is highly expressed in male samples.
Besides Violin Plot, Pegasus also provides Heat Map.
It is also useful to add a dendrogram to the graph to bring together similar clusters or genes, based on the expression of selected genes.
By default, heatmap
shows a Heat Map based on average expression of selected genes within each cluster, and cluster similar genes altogether. You can set attrs_cluster=True
to merge similar clusters:
pg.heatmap(data, attrs=marker_genes, attrs_cluster=True, groupby='anno', dpi=80)
switch_axes=True can switch x-axis (genes) and y-axis (clusters):
pg.heatmap(data, attrs=marker_genes, groupby='anno', groupby_cluster=True, switch_axes=True, dpi=80)
In addition, you can also show Heat Map based on expression of individual cell by setting on_average=False. Notice that this can take some time for a large dataset.
pg.heatmap(data, attrs=marker_genes, groupby='anno', on_average=False,
attrs_cluster=True, groupby_cluster=False, groupby_dendrogram=False, dpi=80)
/Users/yangy197/miniconda3/envs/py37/lib/python3.7/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance. warnings.warn(msg)
By comparing with Violin plot, you can see the Heat maps are consistent with it.
Dot plot is an alternative to Heat map. Besides the mean expression of a gene within a cluster (dot color), it also conveys the fraction of cells within the cluster expressing that gene (dot size):
pg.dotplot(data, genes=marker_genes, groupby='anno')
Similarly as heatmap
, switch_axes=True switches the two axes:
pg.dotplot(data, genes=marker_genes, groupby='anno',switch_axes=True)
Pegasus provides dendrograms based on hierarchical clustering. There are 2 ways for clustering: one is regarding the expression of a selected set of genes:
pg.dendrogram(data, genes=marker_genes, groupby='anno', dpi=100)
Another way is to use a cell embedding for clustering. As this dataset is corrected using Harmony algorithm, with the corrected PCA embedding stored at data.obsm['X_pca_harmony']
, we can simply set rep
parameter in dendrogram
to 'pca_harmony'
to use this embedding for clustering:
pg.dendrogram(data, rep='pca_harmony', groupby='anno', dpi=100)
Now we have a different dendrogram.
Differential Expression (DE) Analysis is to discover cluster-specific marker genes. For each cluster, it compares cells within the cluster with all the others, then finds genes significantly highly expressed (up-regulated) and lowly expressed (down-regulated) for the cluster.
After running de_analysis
function in Pegasus, we can use Volcano plot to see the DE result. Here we use Louvain cluster 1 for illustration. You can see that this cluster is annotated as "Naive T cell":
data.obs.loc[data.obs['louvain_labels']=='1', 'anno'].astype('str').value_counts()
Naive T cell 6290 Name: anno, dtype: int64
Below is its Volcano plot:
pg.volcano(data, cluster_id = '1', dpi=200)
The plot uses the default thresholds: Log fold change at $\pm 1.0$, and Q-value at $0.05$. Each point stands for a gene. Red ones are significant marker genes: those at right-hand side are up-regulated genes for Cluster 1, while those at left-hand side are down-regulated genes.
In specific, TRAC gene, a marker gene of T cells, is the second to the rightmost up-regulated gene for Cluster 1. This is consistent with its cell type annotation.
The clusters used in Volcano plot depend on how you run de_analysis
function. For this dataset, we performed DE analysis on Louvain clusters, as a result, we have to use their labels in Volcano plot.
By default, volcano
uses t test results. User can set de_test
parameter to either 'fisher'
or 'mwu'
to use other test result, as long as those tests have been performed in de_analysis
ahead.
See Pegasus_Tutorial for tutorial on running downstream analysis using Pegasus.
See Plotting_Documentation for details on parameters of Pegasus plotting functions.