Assembly, gene prediction, de-redundancy
Assembly
Assembly is to splice the reads obtained by sequencing into longer sequences according to a certain algorithm. This is because the reads of next-generation sequencing are generally short, while long sequences can improve the efficiency, accuracy and utilization of reads in downstream analysis.
Assembly strategy
MEGAHIT [1] is an efficient assembly tool based on -mer and de Bruijn assembly strategy, which can effectively deal with the uneven sequencing depth of different regions of the genome (or genomes from different species) in metagenomics sequencing.
Evaluation assembly quality
N50 means sorting and accumulating the lengths of contig/scaffold from long to short. When the cumulative sum reaches 50% of the total length of contig/scaffold, the length of the last contig/scaffold is contig/scaffold N50. It is generally believed that a larger N50 indicates a better assembly result.
Software | Version | Link |
---|---|---|
MEGAHIT [1:1] | 1.2.9 | megahit --min-count 2 --k-min 93 --k-max 133 --k-step 10 --no-mercy --min-contig-len 200 --continue Note: k-min and k-max depend on read length,- PE100,--k-min=53,--k-max=93; - PE150,--k-min=93,--k-max=133 |
Gene prediction
Various signal sites of prokaryotic genes (such as promoter and terminator) are highly specific and easy to identify. We use MetaGeneMark [2] for de novo prediction of metagenomic genes. De novo prediction is based on given sequence features, mainly relying on the different characteristic information of coding regions and non-coding regions, and statistically describing them to build a probability model to distinguish coding and non-coding regions. De novo prediction can predict both known and unknown genes.
Software | Version | Commanders |
---|---|---|
MetaGeneMark [2:1] | 3.38 | gmhmmp -a -d -f G -m MetaGeneMark_v1.mod |
Remove redundant genes
The gene prediction results of each sample need to be deredundantly processed. CD-HIT [3] adopts a greedy incremental clustering method, first sorts the input sequences in order from long to short. The longest sequence is classified to the first type and serves as the representative sequence of the first type. The remaining sequences are then compared to representative sequences found before it. According to the sequence similarity (generally set the identity threshold to 95%, and the coverage threshold to 90%), the sequence will be classified into one of the categories or make it the representative sequence of a new type, so that all sequences are traversed to complete the clustering process.
Software | Version | Commanders |
---|---|---|
CD-HIT [3:1] | 4.8.1 | cd-hit-est -aS 0.9 -c 0.95 -d 0 -g 1 |
Gene Abundance Determination
After constructing a non-redundant gene set, use TPM (Transcripts Per Million) to measure the abundance of different genes. Compared with the original sequencing data, TPM normalizes the gene length and sequencing depth. The calculation formula is as follows:
: th gene
:length of th th gene
:Number of reads aligned to the th gene
TPM calculation process of a gene in a sample:
- Divide the number of reads aligned to the gene by the length of the gene (the length of the exon region, in kb), and then get the number of reads per kilobase, that is RPK (Reads Per Kilobase);
- Divide the total RPK in a sample by 10^6
- Divide RPK by the value obtained in step 2 to get TPM.
Gene abundance is obtained by Salmon [4] 。
Software | Version | Commander |
---|---|---|
Salmon [4:1] | 1.6.0 | salmon quant -l A --validateMappings |
Info
Gene abundance tables, delivered with Clean Data. Please note that sample names in these tables are the sample names when you delivery the sample or reconfirmed names when client manager required you to provide. File path(es) are
- GeneAnalysis/Abundance/gene.relative.xls:relative abundance.
- GeneAnalysis/Abundance/gene.absolute.xls:absolute abundance.
FAQ
Q: Can TPM be used for comparisons between different samples?
A:Yes. According to the principle of TPM, the sum of TPM of all genes in different samples is equal, so TPM is similar to the relative abundance in species annotation results, and can be compared between different samples/groups.
Reference
Li, D., Liu, C.-M., Luo, R., Sadakane, K., & Lam, T.-W. (2015). MEGAHIT: An Ultra-Fast Single-Node Solution for Large and Complex Metagenomics Assembly Via Succinct De Bruijn Graph. Bioinformatics, 31(10), 1674–1676. https://doi.org/10.1093/bioinformatics/btv033 ↩︎ ↩︎
Zhu, W., Lomsadze, A., & Borodovsky, M. (2010). Ab Initio Gene Identification in Metagenomic Sequences. Nucleic Acids Research, 38(12), e132. https://doi.org/10.1093/nar/gkq275 ↩︎ ↩︎
Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of Highly Homologous Sequences to Reduce the Size of Large Protein Databases. Bioinformatics (Oxford, England), 17(3), 282–283. https://doi.org/10.1093/bioinformatics/17.3.282 ↩︎ ↩︎
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression. Nature Methods, 14(4), 417–419. https://doi.org/10.1038/nmeth.4197 ↩︎ ↩︎