Configuration ============= Once installed and initialized, spacemake needs to be configured. .. include:: shared/shared_sample_variables.rst Configure species ----------------- .. _configure-species: To add species, the following command can be used:: spacemake config add_species \ --name NAME \ # name of the species to be added --reference REF \ # name of the reference sequence # ('genome', 'rRNA', 'spike_in', ...) # if omitted defaults to 'genome' --sequence SEQUENCE \ # path to the reference sequence file # (.fa) to be added --genome SEQUENCE \ # DEPRECATED! Please use --sequence instead. --annotation ANNOTATION \ # path to the annotation (.gtf) file for the species # to be added The ``spacemake config update-species`` takes the same arguments as above, while ``spacemake config delete-species`` takes only ``--name``. As of version ``0.7`` you can add multiple reference sequences per species. For that, simply execute ``add-species`` multiple times, varying ``--reference ...`` but keeping ``--name`` constant. To list the currently available ``species``, type:: spacemake config list-species Configure adapter-flavors and pre-processing -------------------------------------------- .. _configure-adapter-flavor: Spacemake allows to pre-process raw reads based on adapter-flavors. An adapter-flavor describes how adapters and polyA stretches should be trimmed from the cDNA read (usually read2). The complete set of operations that can be performed are: - trim polyA stretches - trim adapters - clip low-quality bases - clip fixed number of bases from either end of read2 Access to these operations is provided through the ``adapter-flavors`` section of ``config.yaml`` only. Here is an example of an adapter-flavor: .. code-block:: yaml adapter_flavors: example: - nextseq_quality: cutoff: 25 - polyA: - adapter: name: SMART seq: AAGCAGTGGTATCAACGCAGAGTGAATGGG where: left min_overlap: 10 max_errors: 0.1 Below follows a list of each operation and the supported parameters and default values. quality ^^^^^^^ Trim low-quality bases from 3' end and/or 5' end of read. Functionality is provided by `cutadapt `_. Two parameters are supported: ``left`` and ``right``, which define the quality threshold below which bases will be trimmed from the 5' and 3' end of read2, respectively. Default is ``left: 0`` and ``right: 25``. nextseq_quality ^^^^^^^^^^^^^^^ Trim low-quality bases from 3' end of read. Functionality is provided by `cutadapt `_. The sole parameter is ``cutoff``, which defines the quality threshold below which bases will be trimmed. Analogous to quality with ``right=cutoff``, except that terminal `G` nucleotides are always treated as below cutoff quality. Default is ``cutoff: 25``. .. note:: Before version ``0.9.1`` there was no quality trimming of bases at all, which led to issues on some runs. Between versions ``0.9.1`` and ``0.9.5``, the default was set to ``nextseq_quality`` with ``cutoff: 32``, which is a common default for quality trimming, but relatively strict. In version ``0.9.5`` the default was changed to ``cutoff: 25``, which is in our experience a good compromise, because low quality bases may still be soft-clipped in the mapping stage. However, if you experience a drop in UMI counts between pre ``0.9.1`` and current versions, you can try lowering the quality cutoff further (or even set it to 0) and rerun your samples, to restore pre ``0.9.1`` behavior. clip ^^^^ Clip bases from either end of read2. Two parameters are supported: ``left`` and ``right``, which define how many bases should be clipped from the 5' and 3' end of read2, respectively. Default is ``left: 0`` and ``right: 0``. polyA ^^^^^ Trim polyA stretches from 3' end of read. Functionality is provided by `cutadapt `_. The only supported parameter is ``revcomp``, which if set to ``True`` will trim polyT stretches instead of polyA. Default is ``revcomp: False``. adapter ^^^^^^^ Trim adapters from either end of read. Functionality is provided by `cutadapt `_. Paraneters are: - ``name``: name of the adapter. Only for logging purposes. - ``seq``: sequence of the adapter to be trimmed. - ``min_overlap``: minimum overlap between read and adapter for a successful trimming. Default is ``3``. - ``max_errors``: maximum error rate allowed for a successful trimming. Default is ``0.1``. - ``where``: where to search for the adapter. Possible values are ``'left'``, and ``'right'``. Default is ``'right'`` (3 prime end of cDNA). .. note:: Internally, spacemake uses the cutadapt python module to perform all trimming operations. If ``where == 'left'`` we use ``cutadapt.adapters.NonInternalFrontAdapter``, for ``where == 'right'`` we use ``cutadapt.adapters.BackAdapter``. For more information about the parameters and their meaning, please refer to the `cutadapt source code `_. Each adapter-flavor in the ``config.yaml`` is a list of operations to be performed in the given order. If needed, you can chain multiple operations of the same type (for example to remove multiple adapters). CRAM/BAM tags with pre-processing info ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that `spacemake` keeps a record of pre-processing steps for each read in the CRAM/BAM file tags, so it is always possible to track which operations were performed: - A3: comma-separated list of adapters detected and trimmed from the 3'end. May also contain "polyA" if polyA trimming was performed and/or "Q" if quality trimming was performed. - A5: comma-separated list of adapters detected and trimmed from the 5'end. - T3: comma-separated list of number of bases trimmed from the 3'end (synced with A3). - T5: comma-separated list of number of bases trimmed from the 5'end (synced with A5). Here is an example: .. code-block:: text read_name 163 chr1 1000 60 50M = 1050 100 ACGT... * NM:i:0 A3:Z:Q,polyA A5:Z:SMART T3:Z:5,10 T5:22 The tags indicate that from the 3' end of the read, first 5 bases were trimmed due to quality (Q), then 10 bases of a polyA stretch. From the 5' end, 22 bases of SMART adapter were trimmed. Configure barcode-flavors -------------------------- .. _configure-barcode-flavor: This sample-variable describes how the cell-barcode and the UMI should be extracted from Read1 and Read2. The ``default`` value for barcode\_flavor will be dropseq: ``cell = r1[0:12]`` (cell-barcode comes from first 12nt of Read1) and ``UMI = r1[12:20]`` (UMI comes from the 13-20 nt of Read1). **If a sample has no barcode\_flavor provided, the default barcode\_flavor will be used** Barcode correction ^^^^^^^^^^^^^^^^^^ As of version ``0.9.3``, spacemake performs spatial barcode correction with edit distance 1, which boosts counts by ~5-15% for many samples. For performance reasons, this employs some heuristics: - all ``N`` bases are replaced with ``A``, in the reference (flowcell) catalog, as well as in the samples. - a capture-area catalog of reference barcodes is built for each samples, based on exact match counts alone. - exact matches to the capture-area catalog are searched first and preferred. Unmatched barcodes go on to a second stage of potential error correction. - spacemake looks all edit distance 1 variants of an unmatched sample barcode in the capture-area catalog in a defined order. The first match is reported and no further matches are considered. The order is as follows: (1) substitutions, (2) insertions, (3) deletions. This means that if a barcode has no exact matches, but **multiple** edit 1 matches, the correction will be deterministic, but is not guaranteed to be correct. In practice, however, the fraction of barcodes with multiple edit 1 matches is extremely low and dwarfed by other sources of experimental and technical noise. .. note:: Barcode correction requires to configure ``--puck-barcode-files`` for your sample. Otherwise it will not be treated as a spatial sample and no capture-area catalog can be built. .. note:: If you have already run your samples with a previous version of spacemake and want to apply the new barcode correction, you can run ``spacemake run estimate-correction-gains`` to get an estimate of the increase in UMI counts to expect for each sample. In our experience, this is close to the actual increase, unless your ratio of reads to UMIs is already high, indicating saturation of the library, in which case the gains may be lower. If you want to give it a try, just update spacemake and run again. The correction should be applied automatically. Provided barcode-flavors ^^^^^^^^^^^^^^^^^^^^^^^^^ .. note:: Future versions of spacemake will merge barcode-flavors into adapter-flavors (which arguably become pre-processing flavors at that point) by defining ``barcode`` as a pre-processing step with ``cell`` and ``UMI`` as parameters. In the current implementation, barcode-flavors are kept separate for backwards compatibility. The new implementation will give additional flexibity, for example to remove additional adapters/primers, or clip the read further, after barcode extraction. Currently, if ``barcode`` is not in the list of pre-processing steps, it is taken to be implied as the last step and its parameters are loaded from the barcode-flavor. Spacemake provides the following barcode-flavors out of the box: .. code-block:: yaml default: cell: "r1[0:12]" UMI: "r1[12:20]" openst: cell: "r1[2:27]" UMI: "r2[0:9]" sc_10x_v2: cell: "r1[0:16]" UMI: "r1[16:26]" seq_scope: UMI: "r2[0:9]" cell: "r1[0:20]" slide_seq_14bc: cell: "r1[0:14]" UMI: "r1[14:23]" slide_seq_15bc: cell: "r1[0:14]" UMI: "r1[15:23]" visium: cell: "r1[0:16]" UMI: "r1[16:28]" To list the currently available ``barcode-flavor``-s, type:: spacemake config list_barcode-flavors .. warning:: The command line interface for adding, updating, and deleting barcode-flavors will be deprecated in future versions of spacemake. Please consider editing the ``config.yaml`` file directly to manage barcode-flavors. Add a new barcode\_flavor ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: spacemake config add_barcode-flavor \ --name NAME \ # name of the barcode flavor --umi UMI \ # structure of UMI, using python's list syntax. # Example: to set UMI to 13-20 NT of Read1, use --umi r1[12:20]. # It is also possible to use the first 8nt of Read2 as UMI: --umi r2[0:8]. --cell-barcode CELL-BARCODE # structure of CELL BARCODE, using python's list syntax. # Example: to set the cell-barcode to 1-12 nt of Read1, use --cell-barcode r1[0:12]. # It is also possible to reverse the CELL BARCODE, for instance with r1[0:12][::-1]. Update/delete a barcode-flavor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``spacemake config update-barcode-flavor`` takes the same arguments as above, while ``spacemake config delete-barcode-flavor`` takes only ``--name``. Configure run-modes -------------------- .. _configure-run-mode: Specifying a "run mode" is an essential flexibity that spacemake offers. Through setting a ``run-mode``, a sample can be processed and analysed downstream in various fashions. Each ``run-mode`` can have the following variables: ``n_beads`` number of cell-barcode expected ``umi_cutoff`` a list of integers. downstream the analysis will be run using these UMI cutoffs, that is cell-barcodes with less UMIs will be discarded ``clean_dge`` whether to clean cell-barcodes from overhang primers, before creating the DGE. ``detect_tissue`` (spatial only) if ``True``, apart from UMI cutoff spacemake will try to detect the tissue *in-silico*. ``polyA_adapter_trimming`` if ``True`` 3' polyA stretches and apaters will be trimmed from Read2. ``count_intronic_reads`` if ``True`` intronic reads will be counted when creating the DGE. ``count_mm_reads`` if ``True`` multi-mappers will be counted. Only those multi-mapping reads will be counted this way, which map to exactly one CDS or UTR segment of a gene. ``mesh_data`` (spatial only) if ``True`` a mesh will be created when running this ``run-mode``. ``mesh_type`` (spatial only) spacemake currently offers two types of meshes: (1) ``circle``, where circles with a given ``mesh_spot_diameter_um`` will be placed in a hexagonal grid, ``mesh_spot_distance_um`` distance apart; (2) a hexagonal grid, where equal hexagons with ``mesh_spot_diameter_um`` sides will be placed in a full mesh grid, such that the whole area is covered. ``mesh_spot_diameter_um`` (spatial only) the diameter of the mesh spatial-unit, in microns. ``mesh_spot_distance_um`` (spatial only, only for circle mesh) distance between the meshed circles, in microns. ``spatial_barcode_min_matches`` (spatial only) ratio spatial barcode matches, expressed as 0-1 interval, used as a minimum threshold to filter out pucks from DGE creation and subsequent steps of the pipeline. If set to 0, no pucks are excluded. ``parent_run-mode`` Each ``run-mode`` can have a parent, to which it will fall back. If a one of the ``run-mode`` variables is missing, the variable of the parent will be used. If parent is not provided, the ``default`` ``run-mode`` will be the parent. Provided run-modes ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml default: clean_dge: false count_intronic_reads: true count_mm_reads: false detect_tissue: false mesh_data: false mesh_spot_diameter_um: 55 mesh_spot_distance_um: 100 mesh_type: circle n_beads: 100000 polyA_adapter_trimming: true spatial_barcode_min_matches: 0 umi_cutoff: - 100 - 300 - 500 openst: clean_dge: false count_intronic_reads: true count_mm_reads: true detect_tissue: false mesh_data: true mesh_spot_diameter_um: 7 mesh_spot_distance_um: 7 mesh_type: hexagon n_beads: 100000 polyA_adapter_trimming: true spatial_barcode_min_matches: 0.1 umi_cutoff: - 100 - 250 - 500 scRNA_seq: count_intronic_reads: true count_mm_reads: false detect_tissue: false n_beads: 10000 umi_cutoff: - 500 seq_scope: clean_dge: false count_intronic_reads: false count_mm_reads: false detect_tissue: false mesh_data: true mesh_spot_diameter_um: 10 mesh_spot_distance_um: 15 mesh_type: hexagon n_beads: 1000 umi_cutoff: - 100 - 300 slide_seq: clean_dge: false detect_tissue: false n_beads: 100000 umi_cutoff: - 50 visium: clean_dge: false count_intronic_reads: false count_mm_reads: true detect_tissue: true n_beads: 10000 umi_cutoff: - 1000 .. note:: If a sample has no ``run-mode`` provided, the ``default`` will be used .. note:: If a ``run-mode`` variable is not provided, the variable of the default ``run-mode`` will be used To list the currently available ``run-mode``-s, type:: spacemake config list_run-modes .. warning:: The command line interface for adding, updating, and deleting run\_modes will be deprecated in future versions of spacemake. Please consider editing the ``config.yaml`` file directly to manage run-modes. Add a new run\_mode ^^^^^^^^^^^^^^^^^^^ See the :ref:`variable descriptions ` above. .. code-block:: spacemake config add_run-mode \ --name NAME \ --parent_run_mode PARENT_RUN_MODE \ --umi_cutoff UMI_CUTOFF [UMI_CUTOFF ...] \ --n_beads N_BEADS \ --clean_dge {True,true,False,false} \ --detect_tissue {True,true,False,false} \ --polyA_adapter_trimming {True,true,False,false} \ --count_intronic_reads {True,true,False,false} \ --count_mm_reads {True,true,False,false} \ --mesh_data {True,true,False,false} \ --mesh_type {circle,hexagon} \ --mesh_spot_diameter_um MESH_SPOT_DIAMETER_UM \ --mesh_spot_distance_um MESH_SPOT_DISTANCE_UM Update/delete a run-mode ^^^^^^^^^^^^^^^^^^^^^^^^^ The ``spacemake config update-run-mode`` takes the same arguments as above, while ``spacemake config delete-run-mode`` takes only ``--name``. Configure pucks --------------- .. _configure-puck: Each spatial sample is associated with a ``puck``. The ``puck`` variable defines the dimensionality of the underlying spatial structure, which spacemake uses during the automated analysis and plotting, as well as the binning (meshing) of the data when selected in the ``run-mode``. Each puck has the following variables: - ``width_um``: the width of the puck, in microns - ``spot_diameter_um``: the diameter of bead on this puck, in microns. - ``barcodes`` (optional): the path to the barcode file, containing the cell\_barcode and (x,y) position for each. This is handy when several pucks have the same barcodes, such as for 10x Visium. - ``coordinate_system`` (optional): the path to the coordinate system file, containing puck IDs and the (x,y,z) position for each, in global coordinates. This coordinate system is analogous to the global coordinate system for image stitching. When specified, this 'stitching' is automatically performed on ``puck``-s with spatial information. Provided pucks ^^^^^^^^^^^^^^ .. code-block:: yaml default: coordinate_system: '' spot_diameter_um: 10 width_um: 3000 openst: coordinate_system: puck_data/openst_coordinate_system.csv spot_diameter_um: 0.6 width_um: 1200 seq_scope: spot_diameter_um: 1 width_um: 1000 slide_seq: spot_diameter_um: 10 width_um: 3000 visium: barcodes: puck_data/visium_barcode_positions.csv spot_diameter_um: 55 width_um: 6500 The ``visium`` puck comes with a ``barcodes`` variable, which points to ``puck_data/visium_barcode_positions.csv``. Similarly, the ``openst`` puck comes with a ``coordinate_system`` variable, pointing to ``puck_data/openst_coordinate_system.csv``. Upon initiation, these files will automatically placed there by spacemake To list the currently available ``puck``-s, type:: spacemake config list_pucks .. warning:: The command line interface for adding, updating, and deleting pucks will be deprecated in future versions of spacemake. Please consider editing the ``config.yaml`` file directly to manage pucks. Add a new puck ^^^^^^^^^^^^^^ .. code-block:: spacemake config add_puck \ --name NAME \ # name of the puck --width_um WIDTH_UM \ --spot_diameter_um SPOT_DIAMETER_UM \ --barcodes BARCODES \ # path to the barcode file, optional --coordinate_system COORDINATE_SYSTEM # path to the coordinate system file, optional Custom snakemake rules ---------------------- As of version ``0.7`` it is now possible to add custom snakemake rules to your spacemake workflow. Simply add the following line to the ``config.yaml`` in your spacemake root folder: .. code-block:: yaml custom_rules: /path/to/my_own_custom_snakefile.smk Within your custom code, you can import spacemake modules and have access to internal variables. If you need to make spacemake aware of new top-level targets that have to be made, you can register a callback .. code-block:: python register_module_output_hook(get_my_custom_targets, "my_own_custom_snakefile.smk") The function `get_my_custom_targets()` will be called once all other, internal spacemake code has been executed and is expected to return a list of files that will be appended to the ``input:`` dependencies of the top-level rule. Providing rules to make these files is up to your custom rules. The second parameter is more for logging purposes and allows to track which module or part of the code injected which dependencies. By default, it is good practive to use the filename.