Filtering data with panpipes

Running the pipeline

Following the ingest tutorial we now have all the single cell data combined into a single .h5mu object. In the previous tutorial we specified the sample_id: teaseq in the pipeline.yml config file, which ensured we saved the teaseq_unfilt.h5mu object as the main output of the ingest workflow. Within the .h5mu object we stored the computed qc metrics.

It’s time to filter the cells to make sure to exclude bad quality cells for downstream analysis.

In your previously created the teaseq directory, create a new folder to run the panpipes preprocess workflow.

# if you are in teaseq/ingest
# cd ..
mkdir preprocessing && cd preprocessing

In here, run panpipes preprocess config, which will generate again a pipeline.yml file for you to customize. You can review and download the preprocess pipeline.yml here.

Open the yml file to inspect the parameters choice.

If you have run the previous step, Ingesting data with panpipes you will have a *.unfilt.h5mu object that you want to apply filtering on. This file is the input that we specify in the yaml, where we also specify the modalities that are included in the object:

#-------------------------------
# General project specifications
#-------------------------------
sample_prefix: teaseq
unfiltered_obj: teaseq_unfilt.h5mu

modalities:
  rna:  True
  prot: True
  rep: False
  atac: True

NOTE: it’s important that the .h5mu object is linked (or renamed) in this directory with a sampleprefix_unfilt structure, cause the pipeline will look for this string to start from.

Either copy the file into the directory you have just created or link the file into the new directory (our preferred choice)

cd preprocessing
ln -s ../teaseq_unfilt.h5mu .
# equivalent of
ln -s ../teaseq_unfilt.h5mu teaseq_unfilt.h5mu 

If you have a custom .h5mu dataset that you didn’t generate with the panpipes ingest workflow, you can use panpipes preprocess workflow to filter the cells according to your own qc criteria.

In panpipes preprocess workflow, we use a dictionary for filtering, which allows the maximum flexibility to filter your cells and features for each modality, allowing you to provide custom names.

The filtering process in panpipes is sequential as it goes through the filtering dictionary. For each modality, starting with rna, it will first filter on obs and then vars. Each modality has a dictionary in the following format.

# MODALITY
# obs:
  # min:
  # max:
  # bool
# var:
  # min:
  # max:
  # bool

This format can be applied to any modality by editing the filtering dictionary and you are not restricted by the columns given as default.

For example, the filtering in the rna modality retains cells with at least 100 counts and less than 40% mt and 25% doublet scores

#------------------------
# RNA-specific filtering
rna:
    # obs, i.e. cell level filtering
    obs:
      min:
        n_genes_by_counts: 100 
      max:
        pct_counts_mt: 40
        pct_counts_rp: 100
        doublet_scores: 0.25
      bool:

Since we’re using the inner join of the mudata object, all the cells passing filtering criteria in all modalities will be retained.

Lastly, in the preprocess workflow we also apply dimensionality reduction (PCA for RNA and PROT, and a choice between PCA and LSI for ATAC )

While it’s common to apply Dimensionality reduction techniques on high-features-numbers assays such as ATAC and RNA, it’s not entirely obvious if it’s beneficial to apply dimensionality reduction on PROT which usually features smaller panels of Features to begin with. This experiment has 47 Features, so we set the number of PCS to 10. If we specify a bigger number of components, the pipeline will automatically reset it to the number of features-1. This is common for all the modalities, although you may never encounter this issue on a big non-targeted RNA experiment!

now run panpipes preprocess make full --local

Look at the log file as it’s generated logs/filtering.log to see what percentage of cells you are filtering based on the selection criteria we specified in the yml.

2023-11-21 18:16:11,864: INFO - Before filtering 2196
2023-11-21 18:16:11,865: INFO - rna
2023-11-21 18:16:11,865: INFO - filter rna pct_counts_mt to less than 40
2023-11-21 18:16:11,883: INFO - Remaining cells 2013
2023-11-21 18:16:11,883: INFO - Remaining cells 2013
2023-11-21 18:16:11,883: INFO - filter rna pct_counts_rp to less than 100
2023-11-21 18:16:11,889: INFO - Remaining cells 2013
2023-11-21 18:16:11,889: INFO - Remaining cells 2013
2023-11-21 18:16:11,889: INFO - filter rna doublet_scores to less than 0.25
2023-11-21 18:16:11,895: INFO - Remaining cells 1917
2023-11-21 18:16:11,895: INFO - Remaining cells 1917
2023-11-21 18:16:11,895: INFO - filter rna n_genes_by_counts to more than 100
2023-11-21 18:16:11,902: INFO - Remaining cells 1914
2023-11-21 18:16:11,902: INFO - filter rna n_cells_by_counts to more than 1
2023-11-21 18:16:11,915: INFO - Remaining cells 1914
2023-11-21 18:16:11,915: INFO - prot
2023-11-21 18:16:11,915: INFO - atac
2023-11-21 18:16:12,229: INFO - intersecting barcodes in rna,prot,atac
2023-11-21 18:16:12,329: INFO - Remaining cells 1914
2023-11-21 18:16:12,329: INFO - (1914, 129082)
2023-11-21 18:16:12,468: INFO - cell_counts
  modality sample_id  n_cells
0      rna    teaseq     1914
1     prot    teaseq     1914
2     atac    teaseq     1914

Let’s take a look at some of the plots produced now.

We can see that we have been effective at filtering some of the cells which lied away from the correlation between the RNA and PROT modalities, looking at the scatterplot of log1p_nUMI in RNA vs PROT across the 3 teaseq samples of the original experiment.

After filtering, we ran normalization, scaling and dimensionality reduction for each modality. We can look at each modality dimensionality reduction, for example the prot PCA highlights a cluster with low UMI counts:

but in the RNA modality, the low-counts cells are distributed across multiple cell groups:

and same for the ATAC:

Changing parameters

You can choose to modify the parameters and re-run a specific task, for example after removing the log file logs/preprocess_rna.log , running panpipes preprocess make rna_preprocess will specifically run the processing task on the rna and exit the pipeline.

If you want to make changes to other parameters like changing normalization methods and then HVG selection and dimensionality reduction, rename the teaseq.h5mu object from the folder and re-run the workflow. (You can also remove all outputs except for the pipeline.yml)

For example, we can change the dimensionality reduction of the ATAC modality from PCA to LSI in the configuration file, as shown in the following excerpt from the pipeline.yml file:

#------------------------------
# ATAC Dimensionality reduction
dimred: LSI  #PCA or LSI
n_comps: 50  #How many components to compute

Then, to apply this change and make sure panpipes re-processes the object with the new dimensionality reduction, we need to remove the previous log file. Note that we can simply change its name to keep the record of the previous run, i.e. : mv logs/preprocess_atac.log logs/preprocess_log1p_atac.log . The new run will create a new log file logs/preprocess_atac.log reflecting the new request.

We also rename the teaseq object with log-normalized counts and PCA for atac teaseq_atac_pca.h5mu since we will use it later.

Now, to apply these changes, we run panpipes preprocess make full --local again. After the workflow finishes, we can inspect the resulting MuData and check for changes in the ["atac"] slot. As you can see, we now have a new normalization layer 'signac_norm', a new set of Highly Variable Features and the LSI for the atac modality.

import muon as mu
mdata = mu.read("teaseq.h5mu")
atac = mdata["atac"]
atac
AnnData object with n_obs × n_vars = 1914 × 107739
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_ATAC', 'nFeature_ATAC', 'original_barcodes', 'n_fragments', 'n_duplicate', 'n_mito', 'n_unique', 'altius_count', 'altius_frac', 'gene_bodies_count', 'gene_bodies_frac', 'peaks_count', 'peaks_frac', 'tss_count', 'tss_frac', 'barcodes', 'cell_name', 'well_id', 'chip_id', 'batch_id', 'pbmc_sample_id', 'DoubletScore', 'DoubletEnrichment', 'TSSEnrichment', 'dataset', 'sample_id', 'n_genes_by_counts', 'total_counts', 'n_counts'
    var: 'features', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'dataset_colors', 'hvg', 'log1p', 'lsi' 
    obsm: 'X_lsi'
    varm: 'LSI'
    layers: 'signac_norm', 'raw_counts'

Next uni and multimodal integration

Note: We find that keeping the suggested directory structure (one main directory by project with all the individual steps in separate folders) is useful for project management. You can of course customize your directories as you prefer, and change the paths accordingly in the pipeline.yml config files!