Processing a custom single-cell sample

In this tutorial we will process a custom single cell sample.

As an example we will be using 1 million reads from this Visium dataset.

Note

Firstly, the example data used here is a 10X Visium dataset, hence it is spatial. However, for the sake of this tutorial, we will be treating it as a single-cell sample.

Secondly, for many methods (such as Visium, 10X Chromium Slide-seq or Seq-scope) spacemake provides pre-defined variables. If you are using one of these methods follow our Quick start guide instead.

Step 1: install and initialize spacemake

To install spacemake follow the installation guide here.

To initialize spacemake follow the initialization guide here.

Step 2: download test data

For the sake of this tutorial we will work with a test dataset: 1 million Read1 and 1 million Read2 reads from a Visium adult mouse brain.

To download the test data:

wget -nv http://bimsbstatic.mdc-berlin.de/rajewsky/spacemake-test-data/visium/test_fastq/visium_public_lane_joined_1m_R1.fastq.gz
wget -nv http://bimsbstatic.mdc-berlin.de/rajewsky/spacemake-test-data/visium/test_fastq/visium_public_lane_joined_1m_R2.fastq.gz

Note

If there is already data available, to be processed and analyzed, this step can be omitted.

Step 3: add a new species

Note

If you initialized spacemake with the --download-species flag, you can omit this step, as spacemake will automatically download and configure mm10 mouse genome.fa and annotation.gtf files for you.

The sample we are working with here is a mouse brain sample, so we have to add a new species:

spacemake config add_species --name mouse \
--annotation /path/to/mouse/annotation.gtf \
--genome /path/to/mouse/genome.fa

Step 4: add a new barcode_flavor

The barcode_flavor will decide which nucletodies of Read1/Read2 extract the UMIs and cell-barcodes from.

In this perticular test sample, the first 16 nucleotides of Read1 are the cell-barcode, and the following 12 nucleotides are the UMIs.

Consequently, we create a new barcode_flavor like this:

spacemake config add_barcode_flavor --name test_barcode_flavor \
--cell_barcode r1[0:16] \
--umi r1[16:28]

Note

There are several barcode_flavors provided by spacemake out of the box, such as visium for 10X Visium or sc_10x_v2 for 10X Chromium v2 kits. The default flavor is identical to a Drop-seq library, with 12 nucleotide cell-barcode and 8 nucleotide UMI.

More info about provided flavors here.

If you want to use one of these, there is no need to add your own flavor.

Step 5: add a new run_mode

A run_mode in spacemake defines how a sample should processed downstream. In this tutorial, we will trim the PolyA stretches from the 3’ end of Read2, count both exonic and intronic reads, expect 5000 cells, and analyze the data, turn off multi-mapper counting (so only unique reads are counted), using 50, 100 and 300 UMI cutoffs. To set these parameters, we define a test_run_mode like this:

spacemake config add_run_mode --name test_run_mode \
--polyA_adapter_trimming True \
--count_mm_reads False \
--n_beads 5000 \
--count_intronic_reads True \
--umi_cutoff 50 100 300

Note

As with barcode_flavors, spacemake provides several run_modes out of the box. For more info check out a more detailed guide here.

Step 6: add the sample

After configuring all the steps above, we are ready to add our (test) sample:

spacemake projects add_sample --project_id test_project \
--sample_id test_sample \
--R1 visium_public_lane_joined_1m_R1.fastq.gz \
--R2 visium_public_lane_joined_1m_R1.fastq.gz \
--species mouse \
--barcode_flavor test_barcode_flavor \
--run_mode test_run_mode

Note

If there is already data available, here the Read1 and Read2 .fastq.gz files should be added, instead of the test files.

Step 7: runn spacemake

Now we can process our samples with spacemake. Since we added only one sample, only one sample will be processed and analyzed. To start spacemake, simply write:

spacemake run --cores 16

Note

The number of cores used should be suited for the machine on which spacemake is ran. When processing more than one samle, we recommend using spacemake with at least 8 cores. In order to achieve maximum parallelism.

Step 8: results

The results of the analysis for this sample will be under projects/test_project/processed_data/test_sample/illumina/complete_data/

Under this directory, there are several files and directories which are important:

  • final.polyA_adapter_trimmed.bam: final, mapped, tagged .bam file. CB tag contains the cell barcode, and the MI contains the UMI-s.

  • qc_sheet_test_sample_no_spatial_data.html: the QC-sheet for this sample, as a self-contained .html file.

  • dge/: a directory containing the Digital Expression Matrices (DGEs)

    • dge.all.polyA_adapter_trimmed.5000_beads.txt.gz: a compressed, text based DGE

    • dge.all.polyA_adapter_trimmed.5000_beads.h5ad: the same DGE but stored in .h5ad format (used by the anndata python package). This matrix is stored as a Compressed Sparse Column matrix (using scipy.sparse.csc_matrix).

    • dge.all.polyA_adapter_trimmed.5000_beads.summary.txt: the summary of the DGE, one line per cell.

    • dge.all.polyA_adapter_trimmed.5000_beads.obs.csv: the observation table of the matrix. Similar to the previous file, more detailed.

  • automated_analysis/test_run_mode/umi_cutoff_50/: In this directory the results of the automated analysis can be found. As it can be seen under the automated_analysis directory there are two further levels, one for run_mode and one for umi_cutoff. This is because one sample can have several run_modes and in the same way one run_mode can have several UMI cutoffs.

    • results.h5ad: the result of the automated analysis, stored in an anndata object. Same as the DGE before, but containing processed data.

    • test_sample_no_spatial_data_illumina_automated_report.html: automated analysis self-contained .html report.

Note

If the test_project had more samples, than those would be automatically placed under projects/test_project. Similarily, under one spacemake directory there can be several projects in parallel, and each will have their own directory structure under the projects/ folder.