Configuration

Once installed and initialized, spacemake needs to be configured.

One of the most important parts of spacemake are the so-called ‘shared sample-variables’. These are reusable, user-definable variables, which we can assign to several samples. They can be shortly defined as follows:

species: a collection of genome, annotation and rRNA_genome. There is no default species, and each sample can have exactly one species.
barcode-flavor: the variable which specifies the structure of Read1 and Read2, namely how the cell-barcode and UMI should be extracted. If no value provided for a sample, the default will be used.
run-mode: each sample can have several run-mode``s, all of which are user definable. If no ``run-mode``s are specified, a sample will be processed using ``default run-mode settings.
puck (spatial only): if a sample is spatial, it has to have a puck variable. If no puck is specified, a default puck will be used.

To add, update, delete or list a shared sample-variable, you can use the following commands:

spacemake config add-<shared-sample-variable>
spacemake config update-<shared-sample-variable>
spacemake config delete-<shared-sample-variable>
spacemake config list-<shared-sample-variable>

where <shared-sample-variable> is one of species, barcode-flavor, run-mode or puck

Configure species

To add species, the following command can be used:

spacemake config add_species \
    --name NAME \         # name of the species to be added
    --reference REF \     # name of the reference sequence
                          # ('genome', 'rRNA', 'spike_in', ...)
                          # if omitted defaults to 'genome'
    --sequence SEQUENCE \ # path to the reference sequence file
                          # (.fa) to be added
    --genome SEQUENCE \   # DEPRECATED! Please use --sequence instead.
    --annotation ANNOTATION \
                          # path to the annotation (.gtf) file for the species
                          # to be added

The spacemake config update-species takes the same arguments as above, while spacemake config delete-species takes only --name.

As of version 0.7 you can add multiple reference sequences per species. For that, simply execute add-species multiple times, varying --reference ... but keeping --name constant.

To list the currently available species, type:

spacemake config list-species

Configure adapter-flavors and pre-processing

Spacemake allows to pre-process raw reads based on adapter-flavors. An adapter-flavor describes how adapters and polyA stretches should be trimmed from the cDNA read (usually read2). The complete set of operations that can be performed are: - trim polyA stretches - trim adapters - clip low-quality bases - clip fixed number of bases from either end of read2

Access to these operations is provided through the adapter-flavors section of config.yaml only. Here is an example of an adapter-flavor:

adapter_flavors:
   example:
      - nextseq_quality:
            cutoff: 25
      - polyA:
      - adapter:
            name: SMART
            seq: AAGCAGTGGTATCAACGCAGAGTGAATGGG
            where: left
            min_overlap: 10
            max_errors: 0.1

Below follows a list of each operation and the supported parameters and default values.

quality

Trim low-quality bases from 3’ end and/or 5’ end of read. Functionality is provided by cutadapt. Two parameters are supported: left and right, which define the quality threshold below which bases will be trimmed from the 5’ and 3’ end of read2, respectively. Default is left: 0 and right: 25.

nextseq_quality

Trim low-quality bases from 3’ end of read. Functionality is provided by cutadapt. The sole parameter is cutoff, which defines the quality threshold below which bases will be trimmed. Analogous to quality with right=cutoff, except that terminal G nucleotides are always treated as below cutoff quality. Default is cutoff: 25.

Note

Before version 0.9.1 there was no quality trimming of bases at all, which led to issues on some runs. Between versions 0.9.1 and 0.9.5, the default was set to nextseq_quality with cutoff: 32, which is a common default for quality trimming, but relatively strict. In version 0.9.5 the default was changed to cutoff: 25, which is in our experience a good compromise, because low quality bases may still be soft-clipped in the mapping stage. However, if you experience a drop in UMI counts between pre 0.9.1 and current versions, you can try lowering the quality cutoff further (or even set it to 0) and rerun your samples, to restore pre 0.9.1 behavior.

clip

Clip bases from either end of read2. Two parameters are supported: left and right, which define how many bases should be clipped from the 5’ and 3’ end of read2, respectively. Default is left: 0 and right: 0.

polyA

Trim polyA stretches from 3’ end of read. Functionality is provided by cutadapt. The only supported parameter is revcomp, which if set to True will trim polyT stretches instead of polyA. Default is revcomp: False.

adapter

Trim adapters from either end of read. Functionality is provided by cutadapt. Paraneters are:

name: name of the adapter. Only for logging purposes.

seq: sequence of the adapter to be trimmed.

min_overlap: minimum overlap between read and adapter for a successful trimming. Default is 3.

max_errors: maximum error rate allowed for a successful trimming. Default is 0.1.

where: where to search for the adapter. Possible values are 'left', and 'right'. Default is 'right' (3 prime end of cDNA).

Note

Internally, spacemake uses the cutadapt python module to perform all trimming operations. If where == 'left' we use cutadapt.adapters.NonInternalFrontAdapter, for where == 'right' we use cutadapt.adapters.BackAdapter.

For more information about the parameters and their meaning, please refer to the cutadapt source code.

Each adapter-flavor in the config.yaml is a list of operations to be performed in the given order. If needed, you can chain multiple operations of the same type (for example to remove multiple adapters).

CRAM/BAM tags with pre-processing info

Note that spacemake keeps a record of pre-processing steps for each read in the CRAM/BAM file tags, so it is always possible to track which operations were performed: - A3: comma-separated list of adapters detected and trimmed from the 3’end. May also contain “polyA” if polyA trimming was performed and/or “Q” if quality trimming was performed. - A5: comma-separated list of adapters detected and trimmed from the 5’end. - T3: comma-separated list of number of bases trimmed from the 3’end (synced with A3). - T5: comma-separated list of number of bases trimmed from the 5’end (synced with A5).

Here is an example:

read_name 163 chr1 1000 60 50M = 1050 100 ACGT... * NM:i:0 A3:Z:Q,polyA A5:Z:SMART T3:Z:5,10 T5:22

The tags indicate that from the 3’ end of the read, first 5 bases were trimmed due to quality (Q), then 10 bases of a polyA stretch. From the 5’ end, 22 bases of SMART adapter were trimmed.

Configure barcode-flavors

This sample-variable describes how the cell-barcode and the UMI should be extracted from Read1 and Read2. The default value for barcode_flavor will be dropseq: cell = r1[0:12] (cell-barcode comes from first 12nt of Read1) and UMI = r1[12:20] (UMI comes from the 13-20 nt of Read1).

If a sample has no barcode_flavor provided, the default barcode_flavor will be used

Barcode correction

As of version 0.9.3, spacemake performs spatial barcode correction with edit distance 1, which boosts counts by ~5-15% for many samples. For performance reasons, this employs some heuristics: - all N bases are replaced with A, in the reference (flowcell) catalog, as well as in the samples. - a capture-area catalog of reference barcodes is built for each samples, based on exact match counts alone. - exact matches to the capture-area catalog are searched first and preferred. Unmatched barcodes go on to a second stage of potential error correction. - spacemake looks all edit distance 1 variants of an unmatched sample barcode in the capture-area catalog in a defined order.

The first match is reported and no further matches are considered. The order is as follows: (1) substitutions, (2) insertions, (3) deletions. This means that if a barcode has no exact matches, but multiple edit 1 matches, the correction will be deterministic, but is not guaranteed to be correct. In practice, however, the fraction of barcodes with multiple edit 1 matches is extremely low and dwarfed by other sources of experimental and technical noise.

Note

Barcode correction requires to configure --puck-barcode-files for your sample. Otherwise it will not be treated as a spatial sample and no capture-area catalog can be built.

Note

If you have already run your samples with a previous version of spacemake and want to apply the new barcode correction, you can run spacemake run estimate-correction-gains to get an estimate of the increase in UMI counts to expect for each sample. In our experience, this is close to the actual increase, unless your ratio of reads to UMIs is already high, indicating saturation of the library, in which case the gains may be lower. If you want to give it a try, just update spacemake and run again. The correction should be applied automatically.

Provided barcode-flavors

Note

Future versions of spacemake will merge barcode-flavors into adapter-flavors (which arguably become pre-processing flavors at that point) by defining barcode as a pre-processing step with cell and UMI as parameters. In the current implementation, barcode-flavors are kept separate for backwards compatibility. The new implementation will give additional flexibity, for example to remove additional adapters/primers, or clip the read further, after barcode extraction. Currently, if barcode is not in the list of pre-processing steps, it is taken to be implied as the last step and its parameters are loaded from the barcode-flavor.

Spacemake provides the following barcode-flavors out of the box:

default:
    cell: "r1[0:12]"
    UMI: "r1[12:20]"
openst:
    cell: "r1[2:27]"
    UMI: "r2[0:9]"
sc_10x_v2:
    cell: "r1[0:16]"
    UMI: "r1[16:26]"
seq_scope:
    UMI: "r2[0:9]"
    cell: "r1[0:20]"
slide_seq_14bc:
    cell: "r1[0:14]"
    UMI: "r1[14:23]"
slide_seq_15bc:
    cell: "r1[0:14]"
    UMI: "r1[15:23]"
visium:
    cell: "r1[0:16]"
    UMI: "r1[16:28]"

To list the currently available barcode-flavor-s, type:

spacemake config list_barcode-flavors

Warning

The command line interface for adding, updating, and deleting barcode-flavors will be deprecated in future versions of spacemake. Please consider editing the config.yaml file directly to manage barcode-flavors.

Add a new barcode_flavor

spacemake config add_barcode-flavor \
   --name NAME \
      # name of the barcode flavor

   --umi UMI \
      # structure of UMI, using python's list syntax.
      # Example: to set UMI to 13-20 NT of Read1, use --umi r1[12:20].
      # It is also possible to use the first 8nt of Read2 as UMI: --umi r2[0:8].

   --cell-barcode CELL-BARCODE
      # structure of CELL BARCODE, using python's list syntax.
      # Example: to set the cell-barcode to 1-12 nt of Read1, use --cell-barcode r1[0:12].
      # It is also possible to reverse the CELL BARCODE, for instance with r1[0:12][::-1].

Update/delete a barcode-flavor

The spacemake config update-barcode-flavor takes the same arguments as above, while spacemake config delete-barcode-flavor takes only --name.

Configure run-modes

Specifying a “run mode” is an essential flexibity that spacemake offers. Through setting a run-mode, a sample can be processed and analysed downstream in various fashions.

Each run-mode can have the following variables:

n_beads: number of cell-barcode expected
umi_cutoff: a list of integers. downstream the analysis will be run using these UMI cutoffs, that is cell-barcodes with less UMIs will be discarded
clean_dge: whether to clean cell-barcodes from overhang primers, before creating the DGE.
detect_tissue (spatial only): if True, apart from UMI cutoff spacemake will try to detect the tissue in-silico.
polyA_adapter_trimming: if True 3’ polyA stretches and apaters will be trimmed from Read2.
count_intronic_reads: if True intronic reads will be counted when creating the DGE.
count_mm_reads: if True multi-mappers will be counted. Only those multi-mapping reads will be counted this way, which map to exactly one CDS or UTR segment of a gene.
mesh_data (spatial only): if True a mesh will be created when running this run-mode.
mesh_type (spatial only): spacemake currently offers two types of meshes: (1) circle, where circles with a given mesh_spot_diameter_um will be placed in a hexagonal grid, mesh_spot_distance_um distance apart; (2) a hexagonal grid, where equal hexagons with mesh_spot_diameter_um sides will be placed in a full mesh grid, such that the whole area is covered.
mesh_spot_diameter_um (spatial only): the diameter of the mesh spatial-unit, in microns.
mesh_spot_distance_um (spatial only, only for circle mesh): distance between the meshed circles, in microns.
spatial_barcode_min_matches (spatial only): ratio spatial barcode matches, expressed as 0-1 interval, used as a minimum threshold to filter out pucks from DGE creation and subsequent steps of the pipeline. If set to 0, no pucks are excluded.
parent_run-mode: Each run-mode can have a parent, to which it will fall back. If a one of the run-mode variables is missing, the variable of the parent will be used. If parent is not provided, the default run-mode will be the parent.

Provided run-modes

default:
  clean_dge: false
  count_intronic_reads: true
  count_mm_reads: false
  detect_tissue: false
  mesh_data: false
  mesh_spot_diameter_um: 55
  mesh_spot_distance_um: 100
  mesh_type: circle
  n_beads: 100000
  polyA_adapter_trimming: true
  spatial_barcode_min_matches: 0
  umi_cutoff:
  - 100
  - 300
  - 500
openst:
  clean_dge: false
  count_intronic_reads: true
  count_mm_reads: true
  detect_tissue: false
  mesh_data: true
  mesh_spot_diameter_um: 7
  mesh_spot_distance_um: 7
  mesh_type: hexagon
  n_beads: 100000
  polyA_adapter_trimming: true
  spatial_barcode_min_matches: 0.1
  umi_cutoff:
  - 100
  - 250
  - 500
scRNA_seq:
  count_intronic_reads: true
  count_mm_reads: false
  detect_tissue: false
  n_beads: 10000
  umi_cutoff:
  - 500
seq_scope:
  clean_dge: false
  count_intronic_reads: false
  count_mm_reads: false
  detect_tissue: false
  mesh_data: true
  mesh_spot_diameter_um: 10
  mesh_spot_distance_um: 15
  mesh_type: hexagon
  n_beads: 1000
  umi_cutoff:
  - 100
  - 300
slide_seq:
  clean_dge: false
  detect_tissue: false
  n_beads: 100000
  umi_cutoff:
  - 50
visium:
  clean_dge: false
  count_intronic_reads: false
  count_mm_reads: true
  detect_tissue: true
  n_beads: 10000
  umi_cutoff:
  - 1000

Note

If a sample has no run-mode provided, the default will be used

Note

If a run-mode variable is not provided, the variable of the default run-mode will be used

To list the currently available run-mode-s, type:

spacemake config list_run-modes

Warning

The command line interface for adding, updating, and deleting run_modes will be deprecated in future versions of spacemake. Please consider editing the config.yaml file directly to manage run-modes.

Add a new run_mode

See the variable descriptions above.

spacemake config add_run-mode \
   --name NAME \
   --parent_run_mode PARENT_RUN_MODE \
   --umi_cutoff UMI_CUTOFF [UMI_CUTOFF ...] \
   --n_beads N_BEADS \
   --clean_dge {True,true,False,false} \
   --detect_tissue {True,true,False,false} \
   --polyA_adapter_trimming {True,true,False,false} \
   --count_intronic_reads {True,true,False,false} \
   --count_mm_reads {True,true,False,false} \
   --mesh_data {True,true,False,false} \
   --mesh_type {circle,hexagon} \
   --mesh_spot_diameter_um MESH_SPOT_DIAMETER_UM \
   --mesh_spot_distance_um MESH_SPOT_DISTANCE_UM

Update/delete a run-mode

The spacemake config update-run-mode takes the same arguments as above, while spacemake config delete-run-mode takes only --name.

Configure pucks

Each spatial sample is associated with a puck. The puck variable defines the dimensionality of the underlying spatial structure, which spacemake uses during the automated analysis and plotting, as well as the binning (meshing) of the data when selected in the run-mode.

Each puck has the following variables:

width_um: the width of the puck, in microns
spot_diameter_um: the diameter of bead on this puck, in microns.
barcodes (optional): the path to the barcode file, containing the cell_barcode and (x,y) position for each. This is handy when several pucks have the same barcodes, such as for 10x Visium.
coordinate_system (optional): the path to the coordinate system file, containing puck IDs and the (x,y,z) position for each, in global coordinates. This coordinate system is analogous to the global coordinate system for image stitching. When specified, this ‘stitching’ is automatically performed on puck-s with spatial information.

Provided pucks

default:
  coordinate_system: ''
  spot_diameter_um: 10
  width_um: 3000
openst:
  coordinate_system: puck_data/openst_coordinate_system.csv
  spot_diameter_um: 0.6
  width_um: 1200
seq_scope:
  spot_diameter_um: 1
  width_um: 1000
slide_seq:
  spot_diameter_um: 10
  width_um: 3000
visium:
  barcodes: puck_data/visium_barcode_positions.csv
  spot_diameter_um: 55
  width_um: 6500

The visium puck comes with a barcodes variable, which points to puck_data/visium_barcode_positions.csv. Similarly, the openst puck comes with a coordinate_system variable, pointing to puck_data/openst_coordinate_system.csv.

Upon initiation, these files will automatically placed there by spacemake

To list the currently available puck-s, type:

spacemake config list_pucks

Warning

The command line interface for adding, updating, and deleting pucks will be deprecated in future versions of spacemake. Please consider editing the config.yaml file directly to manage pucks.

Add a new puck

spacemake config add_puck \
   --name NAME \        # name of the puck
   --width_um WIDTH_UM \
   --spot_diameter_um SPOT_DIAMETER_UM \
   --barcodes BARCODES \ # path to the barcode file, optional
   --coordinate_system COORDINATE_SYSTEM # path to the coordinate system file, optional

Custom snakemake rules

As of version 0.7 it is now possible to add custom snakemake rules to your spacemake workflow. Simply add the following line to the config.yaml in your spacemake root folder:

custom_rules: /path/to/my_own_custom_snakefile.smk

Within your custom code, you can import spacemake modules and have access to internal variables. If you need to make spacemake aware of new top-level targets that have to be made, you can register a callback

register_module_output_hook(get_my_custom_targets, "my_own_custom_snakefile.smk")

The function get_my_custom_targets() will be called once all other, internal spacemake code has been executed and is expected to return a list of files that will be appended to the input: dependencies of the top-level rule. Providing rules to make these files is up to your custom rules.

The second parameter is more for logging purposes and allows to track which module or part of the code injected which dependencies. By default, it is good practive to use the filename.