Pipeline output files and directories¶

The pipeline outputs different OTU tables and corresponding representative sequences. All final outputs can be found in the processing_results folder, under a sub-directory labeled myDataset_results. Files in this directory are labeled systematically, usually with the format myDataset.file_description.otu_similarity.file_type.

Top directory¶

Files in the top results directory are as follows:

myDataset.otu_seqs.N.fasta: FASTA file with the representative sequences for the denovo OTUs, clustered at N%.
myDataset.otu_seqs.dbOTU.fasta: FASTA file with the representative sequences for the distribution-based OTUs.
myDataset.otu_table.N.denovo: OTU table with N% denovo OTUs labeled denovo1, denovo2, … in the rows and samples in the columns.
myDataset.otu_table.N.dbOTU: OTU table with distribution-based OTUs labeled dbotu1, dbotu2, … in the rows and samples in the columns.
myDataset.otu_table.N.denovo_oligotypes: OTU table with N% denovo OTUs separated into unique oligotypes. Each OTU is labeled denovo1.1, denovo1.2, denovo2.1, denovo2.2, denovo2.3, etc. The first number corresponds to the parent denovo OTU number; the second is the oligotype number. Oligotypes are calculated as each unique sequence within an OTU cluster.
myDataset.raw_dereplicated.fasta: FASTA file with all unique sequences in the dataset. Only sequences which appear more times than the MIN_COUNT specified in the summary file are included (default value for MIN_COUNT is 10).
summary_file.txt: Updated summary file containing original processing request and resulting file names.

Also within each dataset’s processing_results directory, 3 sub-directories are created:

Quality control¶

Within the results folder, there is a subfolder called quality_control, which contains various plots diagnostic of dataset quality. Currently, the pipeline outputs:

Histogram showing distribution of read lengths, taken from the first 100,000 reads in the raw FASTQ file.
Bar chart showing number of reads per sample.
File showing percentage of reads thrown out at each processing step (processing_summary.txt).

Note that the information in these files is not always accurate - you should probably do more thorough quality control yourself.

RDP¶

This folder contains the RDP-assigned OTU tables. It has two files:

myDataset.otu_table.N.denovo.rdp_assigned: OTU table with denovo OTUs assigned Latin names with RDP. OTUs are in rows and samples are in columns. OTU names are of the format:
```
k__kingdom;p__phylum;c__class;o__order;f__family;g__genus;s__species;
d__denovoID
```
where the denovoID corresponds to the respective sequence in ../myDataset.otu_seqs.N.fasta.
myDataset.otu_table.dbOTU.rdp_assigned: OTU table with distribution-based OTUs assigned Latin names with RDP. OTUs are in rows and samples are in columns. OTU names are of the format:
```
k__kingdom;p__phylum;c__class;o__order;f__family;g__genus;s__species;
d__dbotuID
```
where the dbotuID corresponds to the respective sequence in ../myDataset.otu_seqs.dbOTU.fasta.

GG¶

This folder contains the Green Genes-assigned OTU table. It has multiple files, which are described further in the following section.

myDataset.otu_table.N.gg.consensusM: closed-reference OTU table with dereplicated sequences mapped to Green Genes using usearch. OTUs are in rows and samples are in columns. OTU names are of the format:
```
k__kingdom;p__phylum;c__class;o__order;f__family;g__genus;s__species;
d__derepID--GGggid
```
where the derepID corresponds to the sequence in ../myDataset.raw_dereplicated.fasta with the same ID number.