Preparing your dataset for processing¶
Raw data files¶
The first step to process your data is to understand what raw data you have.
- Do you have FASTQ or FASTA data?
- Are your sequences already de-multiplexed with one file per sample, or will you need to split sequences by barcodes?
- Do you have unmerged forward and reverse reads that you’ll need to merge?
- Are the primers still in the sequences?
- Are all reads already all trimmed to a certain length?
Every sequencing center provides a different kind of “raw data”, so make sure you know what you’re starting with! The pipeline takes many different kinds of inputs and can perform many different processing steps, so it’s important to know what you’ll be needing.
Summary file¶
This file, named summary_file.txt, is a machine-readable, tab-delimited file that must accompany any dataset directory when uploaded to the cloud. It orchestrates all processing that will happen, and is where you specify each of your processing requests. It should be found in the highest directory level for the dataset directory in the S3 bucket. It is a text file with descriptors for the data and paths to all relevant datafiles within the directory. It can include a True or False flag for whether any associated raw 16S/ITS data has already been processed. It is case-sensitive.
The order in which items are listed between the lines #16S_start
and
#16S_end
(for 16S) and between lines #ITS_start
and #ITS_end
(for
ITS) does not matter.
Note that any white space in the summary file should correspond to a single tab character.
Summary file format and options¶
The first line in your summary file should be:
DATASET_ID myDataset
16S and ITS processing parameters are specified by attributes placed on
separate lines between #16S_start
and #16S_end
and/or
#ITS_start
and #ITS_end
. All white spaces in the summary file
should be tabs.
Required attributes are:
PRIMERS_FILE
,BARCODES_MAP
,PROCESSED
,- and one of:
RAW_FASTQ_FILE
,RAW_FASTQ_FILES
,RAW_FASTA_FILE
, orRAW_FASTA_FILES
.
For processing to occur, PROCESSED
should be False
(case-sensitive).
The following section will go through all available summary file attributes, and are summarized in List of 16S and ITS attributes. These options are presented in roughly the same order as they are processed.
Input file(s)¶
The pipeline takes as input many kinds of raw data:
one FASTQ file. This file should contain all sequences for all of
your samples. Sequences may still contain barcodes, or they can have
had barcodes removed and FASTQ headers replaced with sample
identifiers. Specify with RAW_FASTQ_FILE
.
multiple FASTQ files. Each of these files should contain sequences
for one sample only. Specify with RAW_FASTQ_FILES
.
one FASTA file. This file should contain all sequences for all of
your samples, and should have a sample identifier in the FASTA header
line. Specify with RAW_FASTA_FILE
.
multiple FASTA files. Each of these files should contain sequences
for one sample only. Specify with RAW_FASTA_FILES
.
If you have one sequence file, RAW_FASTQ_FILE
or RAW_FASTA_FILE
should refer to the FASTQ/A file name.
If you have multiple sequence files, RAW_FASTQ_FILES
or RAW_FASTA_FILES
should refer to a text file that relates each FASTQ/A file to its sample identifier.
This text file is tab-delimited with the FASTQ/A file name in the first column and
the corresponding sample ID in the second column. This text file should
not contain a header.
An example fastq2sid.txt
file mapping each FASTQ file to its sample is:
SRR1324.fastq sample1
SRR1325.fastq sample2
SRR1326.fastq sample3
...
Note that all file paths provided should be relative to the summary
file directory. If your FASTQ to sample ID file map is in a
subdirectory called filemaps/
and your sequence files are in a
subdirectory called datafiles/
, then RAW_FASTQ_FILES
would be
filemaps/fastq_filemap.txt
, and the entries in fastq_filemap.txt
would include the datafiles/
prefix (e.g. datafiles/SRR1324.fastq
, etc.)
Merging¶
If you have unmerged paired-end reads, you can merge them by setting
MERGE_PAIRS
to True
(case sensitive). Merging can be performed
in both the case where your reads are not de-multiplexed (i.e. you have
one file with all of your forward reads for all of your samples, and
one file with all of the reverse reads) and when they are (i.e. you have
two files per sample: one with the forward reads and one with the
reverse reads).
If you need to merge reads, you also need to specify the file name
suffixes for both the forward and reverse read files. These are
specified in FWD_SUFFIX
and REV_SUFFIX
and default to
_R1.fastq
and _R2.fastq
, respectively. If you have more
complicated file names, either rename them to have consistent suffixes,
or talk to Claire to see if you can incorporate more complicated regex
matching into the code.
Note that the files specified either in RAW_FASTQ_FILE
or in the
fastq_filemap.txt
(specified in RAW_FASTQ_FILES
) should refer to
the full name of the FASTQ file(s) containing the forward reads.
See Case 4: multiple demultiplexed raw paired-end FASTQ files of 16S sequences which need merging for an example.
De-multiplexing¶
If your FASTQ file still contains the barcodes in the sequences, you
will need to include BARCODES_MAP
and BARCODES_MODE
in your
summary file. **Note that BARCODES_MAP is a required
attribute. If you do not need to de-multiplex your sequences,
BARCODES_MAP
should be None
(case-sensitive).
BARCODES_MAP
refers to a tab-delimited file which has the sample
identifiers in the first column and the corresponding barcode sequence
in the second column. This file does not have a header.
An example BARCODES_MAP
file could be:
C01 AGAGACAT
C03 AGAGATGT
C05 AGATGTAG
C07 AGCGATCT
C09 AGCTCTAG
C11 AGTACGAG
You may also specify a BARCODES_MODE
, which specifies where the
barcodes are to found in the FASTQ file. If the barcodes are still in
the sequences, BARCODES_MODE
should be 2. If the barcodes are in the
FASTQ sequence header, BARCODES_MODE
should be 1. BARCODES_MODE
defaults to 2.
See Case 1: raw FASTQ file of 16S sequences, still includes primers and barcodes for an example.
Sometimes, the ’raw’ data has already had primers and barcodes removed
but still has all samples in the same FASTQ file. In this case,
BARCODES_MAP
should be None
and the sample IDs must be listed in
the sequence header lines of the FASTQ file. If there is text other than
the sample ID in the header, you need to specify the first non-sample ID
character in BARCODES_SEPARATOR
. For example, sequences in these
kinds of files are often labeled like:
@sample1_seq1
...<rest of fastq record>
@sample1_seq2
...<rest of fastq record>
@sample2@_seq1
...<rest of fastq record>
@sample3_seq1
...<rest of fastq record>
@sample2_seq2
...<rest of fastq record>
In this case, the barcodes separator would be an underscore (_
),
which is the default.
Primer trimming¶
If you need to remove primers from your sequences, you can specify
PRIMERS_FILE
, a text file with your primer sequences. **Note
that PRIMERS_FILE is a required attribute. If you do not need to
remove primers from your sequences, PRIMERS_FILE
should be None
(case sensitive).
Your primers file should have each primer on its own line and no header:
CCTACGGGAGGCAGCAG
ATTACCGCGGCTGCT
The pipeline does not currently remove reverse primers. If your sequences still contain reverse primers, you can remove them yourself or trim your sequences to a length shorter than the start of your reverse primer.
Quality filtering¶
There are two ways to quality filter your sequences. One is based on the number of expected errors in your sequence, and the other truncates reads after a certain quality is encountered. You can learn more about these approaches by reading the USEARCH documentation: http://www.drive5.com/usearch/manual/readqualfiltering.html
To truncate reads after a base with a certain quality is encountered,
use the QUALITY_TRIM
option. A default value that is often used is
25. This step is performed before length trimming.
To discard reads based on their number of expected errors, use the
MAX_ERRORS
option. A default value that is often used is 2 (i.e.
reads with more than 2 expected errors are discarded). This step is
performed after length trimming
If nothing is specified, the pipeline defaults to QUALITY_TRIM
of
25. If both MAX_ERRORS
and QUALITY_TRIM
are specified, quality
filtering by truncation is performed (i.e. MAX_ERRORS
is ignored).
You may also need to specify the encoding of the quality scores.
ASCII_ENCODING
can be either ASCII_BASE_33
(default) or
ASCII_BASE_64
. You can check the encoding of your file using
usearch: usearch -fastq_chars yourFASTQfile.fastq
Length trimming¶
By default, the pipeline trims all reads to 101 base pairs before
dereplication and clustering. You can specify a different length by
using TRIM_LENGTH
. Any reads which are shorter than the specified
length are discarded.
Dereplication¶
In the dereplication step, unique sequences are identified and the
samples from which they came are tracked (sometimes referred to as
“provenancing”). By default, unique sequences which are present fewer
than 10 times in the entire dataset are discarded. If you want to change
this number, specify it with MIN_COUNT
. (e.g. if MIN_COUNT
is 2,
only singleton sequences are discarded).
OTU calling¶
You can specify the similarity used to define OTUs in the
OTU_SIMILARITY
attribute. The default value is 97, corresponding to
97% OTUs.
By default, the pipeline clusters OTUs using both de novo and
closed-reference approaches. If you specify an OTU similarity that does
not have a corresponding Green Genes reference file, closed-reference
clustering will not be performed. OTU similarities supported by Green
Genes closed-reference mapping are: 61, 64, 67, 70, 73, 76, 79, 82, 85,
88, 91, 94, 97, and 99%. The database files used for this mapping can be
found in /home/ubuntu/databases/gg_13_5_otus/rep_set_latin/
.
The pipeline assigns taxonomies to de novo OTUs using the naive-Bayes
RDP classifier. By default, the confidence cutoff is 0.5. You can
specify a different value with the RDP_CUTOFF
attribute.
Distribution-based OTU calling¶
The pipeline also performs distribution-based OTU calling [1]. You
can set the abundance, distance and p value criteria in the summary
file attributes DISTANCE_CRITERIA
, ABUNDANCE_CRITERIA
, and DBOTU_PVAL
.
Distribution-based clustering is not performed by default. You can
turn it on by setting the summary file attribute DBOTU
to True
.
Sample summary files¶
Case 1: raw FASTQ file of 16S sequences, still includes primers and barcodes¶
The simplest case is if you have the following files: a raw FASTQ file; a file specifying the map between barcode sequences and IDs; and a file specifying the primers used. Your summary file would look something like this:
DATASET_ID myDataset
#16S_start
RAW_FASTQ_FILE myData.fastq
ASCII_ENCODING ASCII_BASE_33
PRIMERS_FILE primers.txt
BARCODES_MAP barcodes_map.txt
BARCODES_MODE 2
METADATA_FILE metadata.txt
PROCESSED False
#16S_end
Note that you must also specify the place where barcodes are to be
found, i.e. either in the “>
” sequence ID lines (mode 1) or in
the sequences themselves (mode 2). The PROCESSED
flag tells the
processing instance that the dataset needs to be processed into OTU
tables.
Your barcodes_map.txt
file would look something like this:
S1 ATCGCTAGTA
S2 TCGCTATATA
S3 TCTACAGCGT
S4 CGTACTCAGT
And your primers.txt
file could be:
CCTACGGGAGGCAGCAG
ATTACCGCGGCTGCT
Case 2: raw FASTQ file of ITS sequences, primers and barcodes have been removed¶
In the case where the ’raw’ data has already had primers and barcodes removed (but is not yet de-multiplexed, i.e. all samples are still in the same FASTQ file), the sample IDs must be listed in the sequence ID lines of the FASTQ file. When the pipeline removes barcodes itself and replaces them with sample IDs, individual sequence reads for a given sampleID will be annotated as sampleID;1, sampleID;2, etc., where we note here that the BARCODES_SEPARATOR is ’;’. However, in a dataset where the barcodes have previously been removed, you will have to look into the FASTQ file to check the ’separator’ character. Your summary file would look something like this:
DATASET_ID myDataset
#ITS_start
RAW_FASTQ_FILE myData.fastq
ASCII_ENCODING ASCII_BASE_33
PRIMERS_FILE None
BARCODES_MAP None
BARCODES_SEPARATOR ;
METADATA_FILE metadata.txt
PROCESSED False
#ITS_end
Case 3: multiple demultiplexed raw FASTQ or FASTA files of 16S sequences, each file corresponding to a single sample¶
Sometimes sequencing data are available in a demultiplexed form, where
the reads for each sample are split into separate files. Many datasets
in the SRA, for example, are available in this form. In this case, you
can create a two-column, tab-delimited file where the first column
lists the filename and the second column lists the corresponding sample
ID. Note that paths should be relative paths within the current
directory, e.g. datafiles/file1.txt
for files in a folder called datafiles
within the current directory. In the summary file, the RAW_FASTQ_FILE
line becomes RAW_FASTQ_FILES
(plural), and instead refers to this
filename. If your files are FASTA rather than FASTQ, simply use
RAW_FASTA_FILES
(also plural). For a filename fastq_filemap.txt, your
summary file would look something like this:
DATASET_ID myDataset
#16S_start
RAW_FASTQ_FILES fastq_filemap.txt
ASCII_ENCODING ASCII_BASE_33
PRIMERS_FILE primers.txt
METADATA_FILE metadata.txt
PROCESSED False
PRIMERS_FILE None
BARCODES_MAP None
#16S_end
And your fastq_filemap.txt file would look something like this (note that white spaces in the following example correspond to a single tab character):
SRR10001.fastq S1
SRR10002.fastq S2
SRR10003.fastq S3
SRR10004.fastq S4
Case 4: multiple demultiplexed raw paired-end FASTQ files of 16S sequences which need merging¶
If your 16S FASTQ files are split into forward and reverse paired-end
reads, the pipeline can merge them for you. Specify MERGE_PAIRS
in the
summary file, and also include the filename suffixes corresponding to
forward and reverse reads. If your forward read fastq files were named
sampleID_L001_R1.fastq
and your reverse read fastq files were named
sampleID_L001_R2.fastq
, your summary file would look something like
this:
DATASET_ID myDataset
#16S_start
RAW_FASTQ_FILES fastq_filemap.txt
PRIMERS_FILE None
BARCODES_MAP None
MERGE_PAIRS True
FWD_SUFFIX _L001_R1.fastq
REV_SUFFIX _L001_R2.fastq
PROCESSED False
#16S_end
And your fastq_filemap would look like
S1_L001_R1.fastq S1
S2_L001_R1.fastq S2
S3_L001_R1.fastq S3
S4_L001_R1.fastq S4
If you have de-multiplexed files (as in this example), the file names in your fastq_filemap.txt file should be the forward read fastq files.
If instead you have non-demultiplexed sequences (i.e. two fastq files,
one containing your forward reads and one containing your reverse
reads), RAW_FASTQ_FILE
should point to the file containing the forward
reads.
List of 16S and ITS attributes¶
Attribute | Description |
---|---|
RAW_FASTQ_FILE | Raw FASTQ file name/path within the dataset directory. If MERGE_PAIRS is
True , this should be the full name of the forward read file. |
RAW_FASTA_FILE | Raw FASTA file name/path if raw data is in FASTA format |
RAW_FASTQ_FILES | For demultiplexed datasets where samples are separated
into separate FASTQ files. Filename of two column file
containing FASTQ filenames in first column and sample IDs
in the second column. If MERGE_PAIRS is True , these file
names should be the full forward read file names. |
RAW_FASTA_FILES | For demultiplexed datasets where samples are separated into separate FASTA files. Filename of two column file containing FASTA filenames in first column and sample IDs in the second column. |
ASCII_ENCODING | ASCII quality encoding in FASTQ. Supports either
ASCII_BASE_33 or ASCII_BASE_64 . Set to 33 if unspecified. |
PRIMERS_FILE | Filename/path to primers file.
Required: If primers have already been removed, specify
None . |
BARCODES_MAP | Filename/path to barcodes map file. Tab-delimited file contains sampleIDs in first column and barcode sequences in second column. Required: If barcodes have already been removed, specify |
BARCODES_MODE | 1 = barcodes in sequence ID,2 = barcodes in sequences themselves.3 = barcodes in separate index file (beta)Required if
BARCODES_MAP is not None. |
BARCODES_SEPARATOR | Separator character. See description in De-multiplexing |
METADATA_FILE | Filename/path to metadata file. |
MERGE_PAIRS | If need to merge paired-end reads, set to True . |
FWD_SUFFIX | Filename suffix of files with forward reads.
Should include filename extension. If not specified, defaults to _1.fastq |
REV_SUFFIX | Filename suffix of files with reverse reads.
Should include filename extension. If not specified, defaults to _2.fastq |
PROCESSED | True/False flag for whether data have already been processed. `
required: Set to False for processing to proceed. |
TRIM_LENGTH | Length to which all sequences should be trimmed. Defaults to 101 if unspecified. |
QUALITY_TRIM | Minimum quality score allowed.
Sequences are truncated at the first base having quality
score less than value. Defaults to 25 if unspecified.
If set to None, no quality filtering or trimming will be performed.
If both QUALITY_TRIM and MAX_ERRORS are included in summary file,
MAX_ERRORS will be ignored (even if QUALITY_TRIM = None ). |
MAX_ERRORS | Maximum expected errors allowed.
After length trimming, sequences with more than MAX_ERRORS
expected errors are discarded. If not specified or if a QUALITY_TRIM value
is specified, defaults to quality trimming behavior, above. |
MIN_COUNT | Minimum sequence count in dereplication across all samples. Defaults to 10 if unspecified (i.e. sequences with fewer than 10 occurrences in the entire dataset will not be considered downstream). |
OTU_SIMILARITY | Integer specifying the percent similarity desired in OTU clustering. Defaults to 97 if unspecified. |
RDP_CUTOFF | Desired probability cut-off for Ribosomal Database Project assignments. Assignments at each taxonomic level will be evaluated and those with a lower probability than this cutoff will be labeled as unidentified. Defaults to 0.5 if unspecified. |
GG_ALIGN | Specific to 16S sequences. True/False flag for whether
GreenGenes alignments are desired. Defaults to True if unspecified. |
UNITE_ALIGN | Specific to ITS sequences. True/False flag for whether
UNITE alignments are desired. Defaults to True if unspecified. |
DBOTU | Whether to perform distribution-based OTU calling.
Defaults to False if unspecified. |
ABUNDANCE_CRITERIA | Abundance criteria for distribution-based OTU calling. Defaults to 10 if unspecified. |
DISTANCE_CRITERIA | Distance criteria for distribution-based OTU calling. Defaults to 0.1 if unspecified. |
DBOTU_PVAL | P value cutoff for distribution-based OTU calling. Defaults to 0.0005 if unspecified. |
OUTDIR | Full path to processing directory. Defaults to /home/ubuntu/proc/ if not specified. |
[1] | http://almlab.mit.edu/dbotu3.html |