Description of pipeline OTU tables¶
OTU calling¶
The pipeline produces various OTU tables. All tables are constructed
from the set of raw dereplicated reads. These are a set of reads in
FASTA format that appear in the file datasetID.raw_dereplicated.fasta
and correspond to the set of unique sequences present in the raw data.
de novo OTU tables¶
The dereplicated reads are clustered to within the specified sequence
similarity percentage cut-off (OTU_SIMILARITY
, default: 97) using
usearch. Individual reads are assigned either as the OTU centroid or as
a match by usearch’s output. Centroids are labeled as OTU_ID.0
and matches count from
OTU_ID.1
onwards. These are separate “oligotypes” within the OTU
OTU_ID
. These are written out to an oligotype table in
datasetID.otu_table.*.denovo_oligotypes
.
Note that this definition of ‘oligotype’ is simply the unique sequences within
each OTU cluster.
All oligotype counts are then collapsed to their respective parent
OTU, resulting in a fully de novo OTU table which can be found in the
filename datasetID.otu_table.*.denovo where
*
gets replaced by the
OTU similarity cut-off.
Distribution-based OTU tables¶
A counts table is made from the dereplicated reads and dereplication
map. The counts table and dereplicated reads are used as inputs to
dbotu.py
’s call_otus()
function. Briefly, sequences are
compared against existing OTU representative sequences in descending
abundance order. If the candidate sequence’s abundance is within the
ABUNDANCE_CRITERIA
of an existing OTU and its count distribution
across samples is not significantly different from the OTU’s, then the
candidate sequence is grouped with that OTU. You can learn more about
the distribution-based clustering algorithm at
http://almlab.mit.edu/dbotu3.html
.
Intermediate files such as the membership file (which lists which
sequences belong to which OTU) and the table of sequence counts (used as
an input to the dbotu.py
function call) are in the respective
/home/ubuntu/proc/
folder for your dataset.
Closed-reference OTU table¶
There are two types of database-referencing that are outputed from the
pipeline by default. The first is assignments from the Ribosomal
Database Project (RDP), which returns probabilistic assignments. The
default probability cut-off below which an assignment is labeled as
unidentified at a given taxonomic level is 0.5, but this can be set
using a summary file parameter RDP_CUTOFF
. This OTU table can be found
in the results sub-folder called RDP
. The full classifications can be
found in /home/ubuntu/proc/
, in RDP_classifications.denovo.txt
or RDP_classifications.dbOTU.txt
.
The dereplicated reads are also aligned to a standard database
(GreenGenes in the case of 16S sequences and UNITE in the case of ITS
sequences). In the case of GreenGenes, the database is determined based
on the specified OTU similarity cut-off: e.g. 97_otus.fasta
for
OTU_SIMILARITY
set to 97). The alignment is performed using usearch,
and considers the top 10 hits. Consensus assignments are then produced
for the top 1, top 3, top 5 and top 10 hits (where a taxonomic level is
only assigned a latin name if the top N hits from GreenGenes agree), and
the corresponding OTU tables are output. Thus, the OTU table called
datasetID.otu_table.*.gg.consensus10
contains latin names which are
formed from a minimum consensus of the top 10 hits for each taxonomic
level, where ’*’ gets replaced by the OTU similarity cut-off. Levels are
left unidentified (e.g. s__
) if the consensus requirement is not
met.
Note that the database referencing process is one of the slower steps in
the pipeline, so if you only care about RDP, you can skip the GreenGenes
assignments steps by setting the parameter GG_ALIGN
to False in the
summary file. Similarly, for ITS sequences, you can skip the UNITE
assignments steps by setting the parameter UNITE_ALIGN
to False.
Open-reference OTU table¶
DEPRECATED: The pipeline no longer produces open-reference OTUs.
If any reads in the dereplicated sequences do not align to
GreenGenes/UNITE to within the specified similarity cut-off, they are
clustered with the desired similarity cut-off using usearch, and
appended to the GreenGenes/UNITE closed-reference table to produce a
full, open-reference OTU table (which combines GreenGenes-referenced and
de novo OTUs) at the desired similarity cut-off, with filename
datasetID.otu_table.gg.*.open_ref
.
Intermediate Files¶
A description of the potentially useful intermediate files that you
can find in ~/proc/
. TODO.