Ingesting multimodal (CITE-Seq + VDJ) datasets from cellranger outputs

Datasets and directories

Panpipes can read files from different cellranger outputs. Here we will showcase the example of two multmodal (CITE-Seq + VDJ) datasets from the 10x website while ingesting cellranger multi outputs.

Download dataset 1 ( 10k human PBMCs) from the 10x website

Download dataset 2 (CMV + 5K human Tcells ) from the 10x website

For this tutorial we downloaded the dataset 1 and dataset 2 files from the website and created folders within the ‘outs’ folder to simulate the directory structure of cellranger multi outputs as required by panpipes, as mentioned in our documentation on sample submission

# Here we show the directory structure of the outs folder for the 5k human CMV + Tcells dataset
# only with the essential files needed by panpipes ingest
tree -L 4 ./outs
outs
|-- multi
|   |-- count
|   |   |-- feature_reference.csv
|   |   |-- raw_feature_bc_matrix
|   |   |   |-- barcodes.tsv.gz
|   |   |   |-- features.tsv.gz
|   |   |   `-- matrix.mtx.gz
|   |   |-- raw_feature_bc_matrix.h5
|   `-- vdj_t
`-- per_sample_outs
    `-- human_cmv
        |-- count
        |   |-- feature_reference.csv
        |   |-- sample_filtered_barcodes.csv
        |   |-- sample_filtered_feature_bc_matrix
        |   |-- sample_filtered_feature_bc_matrix.h5
        |-- metrics_summary.csv
        `-- vdj_t
            |-- filtered_contig_annotations.csv

Submission and yaml file

We created a sample submission file which will instruct panpipes on how to find each modality’s path. Download this submission file here.

Besides the first column, “sample_id”, the order in which the columns are provided is not fixed, but the column names are fixed! Failing to specify the column names will result in omission of the modality from the analysis and early stopping of the pipeline.

We find useful to generate the submission file with softwares like Numbers or Excel and save the output as a txt file to ensure that the file is properly formatted. For more examples please check our documentation on sample submission files.

As we are ingesting CITE-Seq data, we created a protein metadata table, with information on if the antibody was an isotype or a hashing antibody. It is important to note that the first column is equivalent to the first column of cellranger’s features.tsv.gz. Download the protein metadata table for this tutorial here.

Run panpipes ingest

Now, activate the environment in which you have installed panpipes, create a 1_ingest folder and configure the ingest workflow within that folder

mkdir 1_ingest
cd 1_ingest
panpipes ingest config

panpipes ingest config This command will generate a config file, pipeline.yml. Modify the config file, or just use sample submission file we provided. You can find the preconfigured pipeline.yml file here.

As you see, since we have raw cellranger multimodal inputs, we can turn on a few tasks that require raw data, for example:

parsing and plotting 10x metrics for all the modalities

plot_10X_metrics: True

estimation of background signal for proteins and rna

assess_background: True

normalization of the protein data using dsb

normalisation_methods: clr,dsb

Please remember to apply the necessary changes in this file to ensure it will run on your computer, and specify:

the environment in which you’re running panpipes, if applicable.
the path to the custom_genes_file that we use to run scanpy.score.genes (we provide an example file in the panpipes’ resources folder)

Now, run the full panpipes ingestion workflow with:

panpipes ingest make full

The pipeline will write to standard output and to a pipeline.log file about the steps it’s running. When it’s finished, you will see a pipeline finished message:

2023-11-27 10:07:10,498 INFO main task - Completed Task = 'pipeline_ingest.full' 
2023-11-27 10:07:10,499 INFO main experiment - job finished in 978 seconds at Mon Nov 27 10:07:10 2023 -- 27.26 12.18  0.14  0.52 -- 85d537bd-19b2-41cd-8134-5d615c1342e5

Panpipes ingest outputs

Let’s inspect the outputs we have generated with panpipes ingest in the ‘1_ingest’ directory”

tree -L 2 ./1_ingest
.
|-- 10x_metrics.csv
|-- _downsampled_cell_metadata.tsv
|-- figures
|   |-- atac
|   |-- background
|   |-- barplot_cellcounts_thresholds_filter.png
|   |-- prot
|   |-- rep
|   |-- rna
|   |-- rna_v_prot
|   `-- tenx_metrics
|-- logs
|   |-- assess_background.log
|   |-- concat_bg_mudatas.log
|   |-- concat_filtered_mudatas.log
|   |-- human_cmv_bg_downsampled.log
|   |-- human_pbmc_bg_downsampled.log
|   |-- load_bg_mudatas_human_cmv.log
|   |-- load_bg_mudatas_human_pbmc.log
|   |-- load_mudatas_human_cmv.log
|   |-- load_mudatas_human_pbmc.log
|   |-- plot_qc.log
|   |-- run_dsb_clr.log
|   |-- run_scanpy_qc_prot.log
|   |-- run_scanpy_qc_rep.log
|   |-- run_scanpy_qc_rna.log
|   |-- run_scrublet_human_cmv.log
|   |-- run_scrublet_human_pbmc.log
|   `-- tenx_metrics_multi_aggregate.log
|-- mm_bg.h5mu
|-- mm_cell_metadata.tsv
|-- mm_prot_qc_metrics_per_sample_id.csv
|-- mm_threshold_filter.tsv
|-- mm_threshold_filter_explained.tsv
|-- mm_unfilt.h5mu
|-- pipeline.log
|-- pipeline.yml
|-- scrublet
|   |-- human_cmv_doubletScore_histogram.png
|   |-- human_cmv_scrublet_scores.txt
|   |-- human_pbmc_doubletScore_histogram.png
|   `-- human_pbmc_scrublet_scores.txt
`-- tmp
    |-- human_cmv.h5mu
    |-- human_cmv_raw.h5mu
    |-- human_pbmc.h5mu
    `-- human_pbmc_raw.h5mu

The final MuData object with computed QC metrics is mm_unfilt.h5mu. A MuData object without QC metrics for each supplied sample in the sample submisison_file is also available and stored in the tmp folder (you can delete this folder to save space). The metadata of the final Mudata object is additionally extracted and saved as a tsv file, mm_cell_metadata.tsv. Lastly, the per sample ADT metrics for each antibody are extracted and also saved as a tsv file, mm_prot_qc_metrics_per_sample_id.csv.

Moreover a MuData object containing information for the background or the raw cellranger output is also created and is mm_bg.h5mu. This is either all the barcodes in the raw cellranger or a downsample object if the downsample param has been set to True in the pipeline.yml (in this case, we have set it to True).

The ingest pipleine also aggregates and outputs all the cellranger summary metrics for all the samples as a tsv file, 10x_metrics.csv.

Additionaly, plots for all modalities for all the summary metrics are individually plotted across all samples and can be found under the figures/tenx_metrics directory. This is a nice way to compare and contrast all the samples in a single panpipes run for the relevant metrics.

10x metrics plots

With the plots in the figures/tenx_metrics, you can evaluate the result of the cellranger multi run for your samples. For example, one can check the sequencing quality of the samples by evaluating:

the scatter plots of the number of cells vs the median umis in the log10 scale for the RNA modality.
the sequencing saturation per vs Number of reads per sample for the RNA modality.

We also plots different metrics across all samples for all the inputted modalities. It might be useful to examine the number of cells across all the samples for rna + prot,BCR and TCR modalities.

one can also look at other relevant 10x_metrics such as the Mean or median reads/UMI counts/genes per cell for the different modalities.

Background plots

The ingest pipeline also plots the cumulative barcode rank plots for all samples (also known as “kneeplot”) , so that it is easier to contrast and compare them and evaluate the number of barcodes called as cells vs not. These plots can be found in figures/background

Moreover, we also evaluate the downsampled background for the rna and prot modality separately and compared to each for all samples. For example a scatter plot visualising the nUMI counts in the empty barcodes (background) vs the cell containing barcodes (foreground) in the protein vs rna.

Additionally, we also plot the most highly expressed genes and proteins in the background as boxplots and heatmaps. This helps you evaluate potential sources of ambient contamination in the samples, like heamoglobin contamination or high mitochondrial gene expression or high background expression of particular antibodies.

RNA QC plots

panpipes ingest workflow computes qc metrics for each modality given as input. For a multimodal sample it may be useful to check RNA metrics under the figures/rna such as number of genes detected in cells vs the total UMI counts , percentage of mitochondrial reads per cell vs library size and doublet scores across the different samples.

Protein QC and normalisation plots

For the PROT metrics it might be useful to check the total counts and mean counts of per ADT across samples. The figures can be found under figures/prot.

As for the prot modality clr and dsb normalisation can also be performed in the panpipes ingest step, we can also visualise the normalised ridge plots for 1) clr & 2) dsb normalisation for each of the ADTs for each sample in the sample_submission file.

The dsb normalisation corrects for background ADT noise and cell-to-cell technical variations and can help to identify a cleaner expression protein signal in comparison to the clr normalisation. Here we show the clr and dsb ridge plots for the human_cmv dataset.

Repertoire QC plots

For the QC of of VDJ data, we utilise the scirpy python package function scirpy.tl.chain_qc to perform QC based on the receptor-chain pairing configuration and categorise the cells into their receptor types and subtypes. This classification of the cells can be checked in the plots under figures/rep.

We can also visualise the proportion of cells belonging to expanded clonotypes for the TCR and BCR data across samples. For example in the human_cmv sample, we see there is a high fraction of cells with expanded TCR clones, but this is not the case of the BCR clones in the human_pbmc sample.

Multiple Filter threshold plots

To aid with the filtering of the data, we also produce outputs that simulate common filtering scenarios, the height of the bar shows the percentage of cells retained if the threshold is applied.

drawing

Users can also inspect the unfiltered mudata generated file by reading it in a python session:

import muon as mu
mu.read("mm_unfilt.h5mu")

We have demonstrated how to run the ingest workflow on multimodal (CITE-Seq + VDJ) data with multiple samples. Filtering of cells and genes is not applied in the ingest workflow but in the preprocess. Inspecting these output should help the user to choose appropriate filters for their data!

Users can follow the preprocessing tutorial for an example on how to filter and further process the data.