Pipeline output files and directories¶
The pipeline outputs different OTU tables and corresponding
representative sequences. All final outputs can be found in the
processing_results
folder, under a sub-directory labeled
myDataset_results
. Files in this directory are labeled systematically,
usually with the format
myDataset.file_description.otu_similarity.file_type
.
Top directory¶
Files in the top results directory are as follows:
- myDataset.otu_seqs.N.fasta: FASTA file with the representative sequences for the denovo OTUs, clustered at N%.
- myDataset.otu_seqs.dbOTU.fasta: FASTA file with the representative sequences for the distribution-based OTUs.
- myDataset.otu_table.N.denovo: OTU table with N% denovo OTUs labeled denovo1, denovo2, … in the rows and samples in the columns.
- myDataset.otu_table.N.dbOTU: OTU table with distribution-based OTUs labeled dbotu1, dbotu2, … in the rows and samples in the columns.
- myDataset.otu_table.N.denovo_oligotypes: OTU table with N% denovo OTUs separated into unique oligotypes. Each OTU is labeled denovo1.1, denovo1.2, denovo2.1, denovo2.2, denovo2.3, etc. The first number corresponds to the parent denovo OTU number; the second is the oligotype number. Oligotypes are calculated as each unique sequence within an OTU cluster.
- myDataset.raw_dereplicated.fasta: FASTA file with all unique
sequences in the dataset. Only sequences which appear more times than
the
MIN_COUNT
specified in the summary file are included (default value forMIN_COUNT
is 10). - summary_file.txt: Updated summary file containing original processing request and resulting file names.
Also within each dataset’s processing_results
directory, 3
sub-directories are created:
Quality control¶
Within the results folder, there is a subfolder called quality_control
,
which contains various plots diagnostic of dataset quality. Currently,
the pipeline outputs:
- Histogram showing distribution of read lengths, taken from the first 100,000 reads in the raw FASTQ file.
- Bar chart showing number of reads per sample.
- File showing percentage of reads thrown out at each processing step
(
processing_summary.txt
).
Note that the information in these files is not always accurate - you should probably do more thorough quality control yourself.
RDP¶
This folder contains the RDP-assigned OTU tables. It has two files:
myDataset.otu_table.N.denovo.rdp_assigned: OTU table with denovo OTUs assigned Latin names with RDP. OTUs are in rows and samples are in columns. OTU names are of the format:
k__kingdom;p__phylum;c__class;o__order;f__family;g__genus;s__species; d__denovoID
where the denovoID corresponds to the respective sequence in
../myDataset.otu_seqs.N.fasta
.myDataset.otu_table.dbOTU.rdp_assigned: OTU table with distribution-based OTUs assigned Latin names with RDP. OTUs are in rows and samples are in columns. OTU names are of the format:
k__kingdom;p__phylum;c__class;o__order;f__family;g__genus;s__species; d__dbotuID
where the dbotuID corresponds to the respective sequence in
../myDataset.otu_seqs.dbOTU.fasta
.
GG¶
This folder contains the Green Genes-assigned OTU table. It has multiple files, which are described further in the following section.
myDataset.otu_table.N.gg.consensusM: closed-reference OTU table with dereplicated sequences mapped to Green Genes using usearch. OTUs are in rows and samples are in columns. OTU names are of the format:
k__kingdom;p__phylum;c__class;o__order;f__family;g__genus;s__species; d__derepID--GGggid
where the derepID corresponds to the sequence in
../myDataset.raw_dereplicated.fasta
with the same ID number.