Quickstart¶
Prepare your data¶
To process your data, put all of your files in the same directory and write your summary file. The summary file contains all of the instructions for processing that the pipeline will use.
Summary file¶
The summary file should be called summary_file.txt
and be
tab-delimited.
Summary file attributes tell the processing script what to do.
Attributes should be in all-caps in the first column, and
associated values in the second column.
Attributes should be between #16S_start
and #16S_end
or #ITS_starts
and #ITS_end
lines.
Everything in the summary file is case-sensitive.
Your summary file must have the following attributes:
PRIMERS_FILE
,BARCODES_MAP
,PROCESSED
,- and one of:
RAW_FASTQ_FILE
,RAW_FASTQ_FILES
,RAW_FASTA_FILE
, orRAW_FASTA_FILES
.
Your summary file could look like this:
DATASET_ID myDataset
#16S_start
RAW_FASTQ_FILES fastq_filemap.txt
PROCESSED False
PRIMERS_FILE None
BARCODES_MAP None
#16S_end
The script considers everything relative to the summary file, so any relative file paths must be valid with respect to where your summary file is.
All of the available summary file attributes and their associated default values are in List of 16S and ITS attributes.
Data¶
The pipeline can handle FASTQ and FASTA files. It can also handle un-demultiplexed data (i.e. reads for all samples are in one file) or de-multiplexed data (i.e. reads for each sample are in separate files). It can also handle un-merged paired-end data.
If you have de-multiplexed data, you need to specify a file map with two columns. The first column has the data files and the second column has the corresponding sample IDs (see Input file(s) for more).
If you need to de-multiplex your data (e.g. remove the barcodes), you should provide a barcodes map with sample IDs in the first column and barcodes in the second column (see De-multiplexing for more).
In general, none of your auxiliary files should have headers (e.g. primers, barcodes, and data-to-sample maps).
Run the pipeline¶
From anywhere, you can run:
python ~/scripts/Master.py -i ~/path/to/summary/file/directory
If your data will take a while to process, we recommend using a screen or tmux session.
Troubleshooting and logs¶
Check both stderr_master.log
and stdout_master.log
in your current directory and the
respective file in ~/logs
(e.g. stderr_datasetID_proc_16s.log
).
If you used nohup, also look at nohup.out in your current directory.
The most common problem is messed up white space in one of your files, for example:
- extra tabs in summary files
- ^M character in your barcodes or fastq file map (has to do with Mac OS’s newline character)
The next most common problem is typos:
- check your file paths
- check the summary file attributes and values
Sometimes the processing can mess up because your data is doing
something unexpected. An easy way to troubleshoot these problems
is to look at file sizes in ~/proc/yourdataset_proc_16S
.
File suffixes indicate what’s been done to each file:
.pt
- primer trimmed.qt
- quality trimmed.sb
- split by barcodes.lt
- length trimmed
Getting help¶
If you’re still getting an error, you can email Claire and Thomas for help with your problem. You can also email the alm-comp [at] mit.edu email list to see if others have encountered something similar.