Pipeline output files and directories¶
The pipeline outputs different OTU tables and corresponding
representative sequences. All final outputs can be found in the
processing_results folder, under a sub-directory labeled
myDataset_results. Files in this directory are labeled systematically,
usually with the format
Files in the top results directory are as follows:
- myDataset.otu_seqs.N.fasta: FASTA file with the representative sequences for the denovo OTUs, clustered at N%.
- myDataset.otu_seqs.dbOTU.fasta: FASTA file with the representative sequences for the distribution-based OTUs.
- myDataset.otu_table.N.denovo: OTU table with N% denovo OTUs labeled denovo1, denovo2, … in the rows and samples in the columns.
- myDataset.otu_table.N.dbOTU: OTU table with distribution-based OTUs labeled dbotu1, dbotu2, … in the rows and samples in the columns.
- myDataset.otu_table.N.denovo_oligotypes: OTU table with N% denovo OTUs separated into unique oligotypes. Each OTU is labeled denovo1.1, denovo1.2, denovo2.1, denovo2.2, denovo2.3, etc. The first number corresponds to the parent denovo OTU number; the second is the oligotype number. Oligotypes are calculated as each unique sequence within an OTU cluster.
- myDataset.raw_dereplicated.fasta: FASTA file with all unique
sequences in the dataset. Only sequences which appear more times than
MIN_COUNTspecified in the summary file are included (default value for
- summary_file.txt: Updated summary file containing original processing request and resulting file names.
Also within each dataset’s
processing_results directory, 3
sub-directories are created:
Within the results folder, there is a subfolder called
which contains various plots diagnostic of dataset quality. Currently,
the pipeline outputs:
- Histogram showing distribution of read lengths, taken from the first 100,000 reads in the raw FASTQ file.
- Bar chart showing number of reads per sample.
- File showing percentage of reads thrown out at each processing step
Note that the information in these files is not always accurate - you should probably do more thorough quality control yourself.
This folder contains the RDP-assigned OTU tables. It has two files:
myDataset.otu_table.N.denovo.rdp_assigned: OTU table with denovo OTUs assigned Latin names with RDP. OTUs are in rows and samples are in columns. OTU names are of the format:
where the denovoID corresponds to the respective sequence in
myDataset.otu_table.dbOTU.rdp_assigned: OTU table with distribution-based OTUs assigned Latin names with RDP. OTUs are in rows and samples are in columns. OTU names are of the format:
where the dbotuID corresponds to the respective sequence in
This folder contains the Green Genes-assigned OTU table. It has multiple files, which are described further in the following section.
myDataset.otu_table.N.gg.consensusM: closed-reference OTU table with dereplicated sequences mapped to Green Genes using usearch. OTUs are in rows and samples are in columns. OTU names are of the format:
where the derepID corresponds to the sequence in
../myDataset.raw_dereplicated.fastawith the same ID number.