Running the pipeline

Processing your data

Once you have placed all relevant files and folders, including the summary file, into a single folder, you can call the script Master.py with the path to this folder as input. Master.py will parse the summary file and launch the appropriate scripts to process your request. Suppose the path to the folder containing the dataset is /home/ubuntu/dataset_folder, and contains a summary file where you specified the dataset ID as ’myDataset’, you would run the following command from anywhere:

python ~/scripts/Master.py -i /home/ubuntu/dataset_folder

If you prefer to run the program in the background, you can add an & to the end of the command. Alternatively, you can use your favorite terminal multi-plexer (e.g. screen or tmux) to run the code while continuing your work. Using nohup ensures that any hangups do not interrupt the processing:

nohup python ~/scripts/Master.py -i /home/ubuntu/dataset_folder

Processing happens in a folder within the proc folder, with path /home/ubuntu/proc/myDataset_proc_16S or /home/ubuntu/proc/myDataset_proc_ITS, depending on the amplicon being analyzed. Final results are put in /home/ubuntu/processing_results/. Log files are put in the /home/ubuntu/logs folder, with two files created for each dataset: stderr_datasetID.log for error and warning messages and stdout_datasetID.log for various progress messages. Certain std out and std err messages also go to the stderr_master.log and stdout_master.log in the directory from which you called the python Master.py command. Make sure to check both of these places when troubleshooting!

Please remove your files from the ~/proc folder once you have checked your processing results! The ~/proc folder can easily fill up - it contains the original raw files are and additional copies corresponding to each processing step!

Underlying function calls

The following table describes the underlying scripts and functions used in each step of 16S and ITS data processing. Python scripts are passed-down Alm lab scripts (for the most part).

All code is in /home/ubuntu/scripts and we encourage users to take a look!

Processing Function/script call
Merging usearch8 -fastq_mergepairs
De-multiplexing 2.split_by_barcodes.py
Primer trimming 1.remove_primers.py
Quality filtering (truncation) usearch8 -fastq_filter -fastq_truncqual
Quality filtering (expected errors) usearch8 -fastq_filter -fastq_maxee
Length trimming (FASTQ) usearch8 -fastq_filter -fastq_trunclen
Length trimming (FASTA) usearch8 -fastx_truncate -trunclen
Dereplication 3.dereplicate.py
De novo clustering usearch8 -cluster_otus -otu_radius_pct
Closed-reference mapping usearch8 -usearch_global -db GG_database_file -strand both
RDP taxonomy assignment rdp_classify.py
Distribution-based clustering dbotu.py call_otus()