Ingesting multimodal (CITE-Seq + VDJ) datasets from cellranger outputs

Datasets and directories

Panpipes can read files from different cellranger outputs. Here we will showcase the example of two multmodal (CITE-Seq + VDJ) datasets from the 10x website while ingesting cellranger multi outputs.

Download dataset 1 ( 10k human PBMCs) from the 10x website

Download dataset 2 (CMV + 5K human Tcells ) from the 10x website

For this tutorial we downloaded the dataset 1 and dataset 2 files from the website and created folders within the ‘outs’ folder to simulate the directory structure of cellranger multi outputs as required by panpipes, as mentioned in our documentation on sample submission

# Here we show the directory structure of the outs folder for the 5k human CMV + Tcells dataset
# only with the essential files needed by panpipes ingest
tree -L 4 ./outs
outs
|-- multi
|   |-- count
|   |   |-- feature_reference.csv
|   |   |-- raw_feature_bc_matrix
|   |   |   |-- barcodes.tsv.gz
|   |   |   |-- features.tsv.gz
|   |   |   `-- matrix.mtx.gz
|   |   |-- raw_feature_bc_matrix.h5
|   `-- vdj_t
`-- per_sample_outs
    `-- human_cmv
        |-- count
        |   |-- feature_reference.csv
        |   |-- sample_filtered_barcodes.csv
        |   |-- sample_filtered_feature_bc_matrix
        |   |-- sample_filtered_feature_bc_matrix.h5
        |-- metrics_summary.csv
        `-- vdj_t
            |-- filtered_contig_annotations.csv

Submission and yaml file

We created a sample submission file which will instruct panpipes on how to find each modality’s path. Download this submission file here.

Besides the first column, “sample_id”, the order in which the columns are provided is not fixed, but the column names are fixed! Failing to specify the column names will result in omission of the modality from the analysis and early stopping of the pipeline.

We find useful to generate the submission file with softwares like Numbers or Excel and save the output as a txt file to ensure that the file is properly formatted. For more examples please check our documentation on sample submission files.

As we are ingesting CITE-Seq data, we created a protein metadata table, with information on if the antibody was an isotype or a hashing antibody. It is important to note that the first column is equivalent to the first column of cellranger’s features.tsv.gz. Download the protein metadata table for this tutorial here.

Run panpipes ingest

Now, activate the environment in which you have installed panpipes, create a 1_ingest folder and configure the ingest workflow within that folder

mkdir 1_ingest
cd 1_ingest
panpipes ingest config

panpipes ingest config This command will generate a config file, pipeline.yml. Modify the config file, or just use sample submission file we provided. You can find the preconfigured pipeline.yml file here.

As you see, since we have raw cellranger multimodal inputs, we can turn on a few tasks that require raw data, for example:

  • parsing and plotting 10x metrics for all the modalities

plot_10X_metrics: True
  • estimation of background signal for proteins and rna

assess_background: True
  • normalization of the protein data using dsb

normalisation_methods: clr,dsb

Please remember to apply the necessary changes in this file to ensure it will run on your computer, and specify:

  • the environment in which you’re running panpipes, if applicable.

  • the path to the custom_genes_file that we use to run scanpy.score.genes (we provide an example file in the panpipes’ resources folder)

Now, run the full panpipes ingestion workflow with:

panpipes ingest make full

The pipeline will write to standard output and to a pipeline.log file about the steps it’s running. When it’s finished, you will see a pipeline finished message:

2023-11-27 10:07:10,498 INFO main task - Completed Task = 'pipeline_ingest.full' 
2023-11-27 10:07:10,499 INFO main experiment - job finished in 978 seconds at Mon Nov 27 10:07:10 2023 -- 27.26 12.18  0.14  0.52 -- 85d537bd-19b2-41cd-8134-5d615c1342e5

Panpipes ingest outputs

Let’s inspect the outputs we have generated with panpipes ingest in the ‘1_ingest’ directory”

tree -L 2 ./1_ingest
.
|-- 10x_metrics.csv
|-- _downsampled_cell_metadata.tsv
|-- figures
|   |-- atac
|   |-- background
|   |-- barplot_cellcounts_thresholds_filter.png
|   |-- prot
|   |-- rep
|   |-- rna
|   |-- rna_v_prot
|   `-- tenx_metrics
|-- logs
|   |-- assess_background.log
|   |-- concat_bg_mudatas.log
|   |-- concat_filtered_mudatas.log
|   |-- human_cmv_bg_downsampled.log
|   |-- human_pbmc_bg_downsampled.log
|   |-- load_bg_mudatas_human_cmv.log
|   |-- load_bg_mudatas_human_pbmc.log
|   |-- load_mudatas_human_cmv.log
|   |-- load_mudatas_human_pbmc.log
|   |-- plot_qc.log
|   |-- run_dsb_clr.log
|   |-- run_scanpy_qc_prot.log
|   |-- run_scanpy_qc_rep.log
|   |-- run_scanpy_qc_rna.log
|   |-- run_scrublet_human_cmv.log
|   |-- run_scrublet_human_pbmc.log
|   `-- tenx_metrics_multi_aggregate.log
|-- mm_bg.h5mu
|-- mm_cell_metadata.tsv
|-- mm_prot_qc_metrics_per_sample_id.csv
|-- mm_threshold_filter.tsv
|-- mm_threshold_filter_explained.tsv
|-- mm_unfilt.h5mu
|-- pipeline.log
|-- pipeline.yml
|-- scrublet
|   |-- human_cmv_doubletScore_histogram.png
|   |-- human_cmv_scrublet_scores.txt
|   |-- human_pbmc_doubletScore_histogram.png
|   `-- human_pbmc_scrublet_scores.txt
`-- tmp
    |-- human_cmv.h5mu
    |-- human_cmv_raw.h5mu
    |-- human_pbmc.h5mu
    `-- human_pbmc_raw.h5mu

The final MuData object with computed QC metrics is mm_unfilt.h5mu. A MuData object without QC metrics for each supplied sample in the sample submisison_file is also available and stored in the tmp folder (you can delete this folder to save space). The metadata of the final Mudata object is additionally extracted and saved as a tsv file, mm_cell_metadata.tsv. Lastly, the per sample ADT metrics for each antibody are extracted and also saved as a tsv file, mm_prot_qc_metrics_per_sample_id.csv.

Moreover a MuData object containing information for the background or the raw cellranger output is also created and is mm_bg.h5mu. This is either all the barcodes in the raw cellranger or a downsample object if the downsample param has been set to True in the pipeline.yml (in this case, we have set it to True).

The ingest pipleine also aggregates and outputs all the cellranger summary metrics for all the samples as a tsv file, 10x_metrics.csv.

Additionaly, plots for all modalities for all the summary metrics are individually plotted across all samples and can be found under the figures/tenx_metrics directory. This is a nice way to compare and contrast all the samples in a single panpipes run for the relevant metrics.

10x metrics plots

With the plots in the figures/tenx_metrics, you can evaluate the result of the cellranger multi run for your samples. For example, one can check the sequencing quality of the samples by evaluating:

  1. the scatter plots of the number of cells vs the median umis in the log10 scale for the RNA modality.

  2. the sequencing saturation per vs Number of reads per sample for the RNA modality.

img1 img2

We also plots different metrics across all samples for all the inputted modalities. It might be useful to examine the number of cells across all the samples for rna + prot,BCR and TCR modalities.

img3 img4 img5

one can also look at other relevant 10x_metrics such as the Mean or median reads/UMI counts/genes per cell for the different modalities.

img6 img7 img8

Background plots

The ingest pipeline also plots the cumulative barcode rank plots for all samples (also known as “kneeplot”) , so that it is easier to contrast and compare them and evaluate the number of barcodes called as cells vs not. These plots can be found in figures/background

img9

Moreover, we also evaluate the downsampled background for the rna and prot modality separately and compared to each for all samples. For example a scatter plot visualising the nUMI counts in the empty barcodes (background) vs the cell containing barcodes (foreground) in the protein vs rna.

img10

Additionally, we also plot the most highly expressed genes and proteins in the background as boxplots and heatmaps. This helps you evaluate potential sources of ambient contamination in the samples, like heamoglobin contamination or high mitochondrial gene expression or high background expression of particular antibodies.

img11 img12

RNA QC plots

panpipes ingest workflow computes qc metrics for each modality given as input. For a multimodal sample it may be useful to check RNA metrics under the figures/rna such as number of genes detected in cells vs the total UMI counts , percentage of mitochondrial reads per cell vs library size and doublet scores across the different samples.

img13 img14 img15 img16

Protein QC and normalisation plots

For the PROT metrics it might be useful to check the total counts and mean counts of per ADT across samples. The figures can be found under figures/prot.

img17 img18

As for the prot modality clr and dsb normalisation can also be performed in the panpipes ingest step, we can also visualise the normalised ridge plots for 1) clr & 2) dsb normalisation for each of the ADTs for each sample in the sample_submission file.

The dsb normalisation corrects for background ADT noise and cell-to-cell technical variations and can help to identify a cleaner expression protein signal in comparison to the clr normalisation. Here we show the clr and dsb ridge plots for the human_cmv dataset.

img19 img20

Repertoire QC plots

For the QC of of VDJ data, we utilise the scirpy python package function scirpy.tl.chain_qc to perform QC based on the receptor-chain pairing configuration and categorise the cells into their receptor types and subtypes. This classification of the cells can be checked in the plots under figures/rep.

img21 img22 img23

We can also visualise the proportion of cells belonging to expanded clonotypes for the TCR and BCR data across samples. For example in the human_cmv sample, we see there is a high fraction of cells with expanded TCR clones, but this is not the case of the BCR clones in the human_pbmc sample.

img24 img25

Multiple Filter threshold plots

To aid with the filtering of the data, we also produce outputs that simulate common filtering scenarios, the height of the bar shows the percentage of cells retained if the threshold is applied.

drawing

Users can also inspect the unfiltered mudata generated file by reading it in a python session:

import muon as mu
mu.read("mm_unfilt.h5mu")

We have demonstrated how to run the ingest workflow on multimodal (CITE-Seq + VDJ) data with multiple samples. Filtering of cells and genes is not applied in the ingest workflow but in the preprocess. Inspecting these output should help the user to choose appropriate filters for their data!

Users can follow the preprocessing tutorial for an example on how to filter and further process the data.