Ingesting multiome datasets from cellranger arc outputs

Panpipes can read files from different cellranger outputs. Here we will showcase the example of a multiome dataset from the 10x website. Download the files from the 10x website

For this tutorial, we have downloaded the files in a folder 10Xmultiome_granulocytes and we generated softlinks to simulate the filenames that are the standard output of the cellranger arc pipeline.

These filenames are the expected inputs to panpipes.

Create your ingestion folder and organize the input data.

mkdir momedocs && cd $_
mkdir data.dir

# Here we show how the input data is organized

tree data.dir
data.dir
├── atac_fragments.tsv.gz -> ../10Xmultiome_granulocytes/atac_fragments.tsv.gz
├── atac_fragments.tsv.gz.tbi -> ../10Xmultiome_granulocytes/atac_fragments.tsv.gz.tbi
├── atac_peak_annotation.tsv -> ../10Xmultiome_granulocytes/atac_peak_annotation.tsv
├── filtered_feature_bc_matrix.h5 -> ../10Xmultiome_granulocytes/filtered_feature_bc_matrix.h5
├── pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz -> ../10Xmultiome_granulocytes/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz
├── pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi -> ../10Xmultiome_granulocytes/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi
├── pbmc_granulocyte_sorted_10k_atac_peak_annotation.tsv -> ../10Xmultiome_granulocytes/pbmc_granulocyte_sorted_10k_atac_peak_annotation.tsv
├── pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5 -> ../10Xmultiome_granulocytes/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
├── pbmc_granulocyte_sorted_10k_per_barcode_metrics.csv -> ../10Xmultiome_granulocytes/pbmc_granulocyte_sorted_10k_per_barcode_metrics.csv
├── pbmc_granulocyte_sorted_10k_summary.csv
└── summary.csv -> pbmc_granulocyte_sorted_10k_summary.csv

We created a sample submission file which will instruct panpipes on how to find each modality’s path. Download this submission file here.

Besides the first column, “sample_id”, the order in which the columns are provided is not fixed, but the column names are fixed! Failing to specify the column names will result in omission of the modality from the analysis and early stopping of the pipeline. We find useful to generate the submission file with softwares like Numbers or Excel and save the output as a txt file to ensure that the file is properly formatted. For more examples please check our documentation on sample submission files.

This is the sample submission file we are using for this tutorial:

sample_id	atac_path	atac_filetype	fragments_file	per_barcode_metrics_file	peak_annotation_file	tissue	diagnosis	rna_path	rna_filetype
mome_granulocytes	data.dir/filtered_feature_bc_matrix.h5	10X_h5	data.dir/atac_fragments.tsv.gz	data.dir/pbmc_granulocyte_sorted_10k_per_barcode_metrics.csv	data.dir/atac_peak_annotation.tsv	granulocytes	healthy	data.dir/filtered_feature_bc_matrix.h5	10X_h5

Make sure the submission file is in the main directory you have generated:

ls -l momedocs

10Xmultiome_granulocytes
data.dir
multiomecaf.txt

Now, activate the environment in which you have installed panpipes and configure the ingest workflow.

panpipes ingest config This command will generate a config file, pipeline.yml. Modify the config file to read in the sample submission file provided. You can find the preconfigured pipeline.yml file here.

Please remember to apply the necessary changes in this file to ensure it will run on your computer, and specify:

the environment in which you’re running panpipes, if applicable.
the path to the custom_genes_file that we use to run scanpy.score.genes (we provide an example file in the panpipes’ resources folder)

Now, run the full panpipes ingestion workflow with:

panpipes ingest make full

The pipeline will write to standard output and to a pipeline.log file about the steps it’s running. When it’s finished, you will see a message:

# 2023-10-12 16:02:07,160 INFO Completed Task = 'pipeline_ingest.full' 
# 2023-10-12 16:02:07,267 INFO job finished in 74 seconds at Thu Oct 12 16:02:07 2023 -- 10.63  1.72 61.48  7.78 -- badc0e60-8451-4e84-b47e-3754f3f913e4

Let’s inspect the outputs we have generated with panpipes ingest

tree ../momedocs -L 1

├── 10Xmultiome_granulocytes
├── 10x_metrics.csv
├── data.dir
├── figures
├── logs
├── mome_cell_metadata.tsv
├── mome_threshold_filter.tsv
├── mome_threshold_filter_explained.tsv
├── mome_unfilt.h5mu
├── multiomecaf.txt
├── pipeline.log
├── pipeline.yml
├── raw_mome.tar.gz
├── scrublet
└── tmp

file	type file	info
10Xmultiome_granulocytes	directory	the folder we downloaded
10x_metrics.csv	text file	output file from parsing the cellranger summary_metrics file
data.dir	directory	the folder with input files we organized
figures	directory	folder storing plots generated throughout the ingest workflow
logs	directory	folder storing logs generated throughout the ingest workflow
mome_cell_metadata.tsv	text file	the cell metadata of the experiment (the .obs slot of the mudata object) saved as a tsv file
mome_threshold_filter.tsv	text file	summary file showing percentage of cells remaining for commonly used QC thresholds on genes and cells. This filtering is not applied in the ingest workflow.
mome_threshold_filter_explained.tsv	text file	file containing the thresholds used to produce the previous file
mome_unfilt.h5mu	h5mu	the mudata generated from the input files
multiomecaf.txt	text file	input sample submission file
pipeline.log	log	pipeline log file, stores all the info on the commands run
pipeline.yml	yaml	input yaml file
scrublet	directory	directory storing the scrublet analysis results
tmp	directory	directory storing temporary h5mu files (one for each sample in the submission file)

Let’s take a look at the figures folder to demonstrate how we have used the workflow to run some analyses on the newly generated mudata.

panpipes ingest workflow computes qc metrics for each modality given as input. For a multiome sample it may be useful to check RNA metrics such as number of genes detected in cells, percentage of mitochondrial reads per cell and library size.

Some atac useful metrics are also plotted, including percentage fragments in peaks, mitochondrial reads mapping to the open chromatin regions, tss enrichment.

To aid with the filtering of the data, we also produce outputs that simulate common filtering scenarios, the height of the bar shows the percentage of cells retained if the threshold is applied.

Users can also inspect the unfiltered mudata generated file by reading it in a python session:

import muon as mu
mu.read("mome_unfilt.h5mu")

We have demonstrated how to run the ingest workflow on multiome data. Filtering of cells and genes is not applied in the ingest workflow but in the preprocess. Inspecting these output should help the user to choose appropriate filters for their data!