Analysis Pipeline

Dr.TomAbout 637 wordsAbout 2 min

Analysis Pipeline

Data Filtering

The sequencing data was filtered with SOAPnuke [1] by (1) Removing reads containing sequencing adapter; (2) Removing reads whose low-quality base ratio (base quality less than or equal to 15) is more than 20%; (3) Removing reads whose unknown base ('N' base) ratio is more than 5%. Afterwards, clean reads were obtained and stored in FASTQ format.

Structure Variation Detection

The clean reads were mapped to the reference genome using HISAT2 [2]. After that, Ericscript (0.5.5-5) [3] and rMATS (v4.1.1) [4] were used to detect fusion genes and differential splicing genes (DSGs), respectively

RNA Identification

Bowtie2[5] was applied to align the clean reads to the gene set.

Gene Quantification Differential Expression Analysis

Expression level of gene was calculated by RSEM (v1.2.28) [6] to get read count, FPKM and TPM. Essentially, differential expression gene (DEG) analysis was performed using the DESeq2 [7] (or DEGseq[8] or PossionDis[9])with Q value ≤ 0.05. The DEG heatmap was drawn by pheatmap [10] according to the DEG analysis results.

Gene Annotation

To take insight to the change of phenotype, GO (http://www.geneontology.org/open in new window) and KEGG (https://www.kegg.jp/open in new window) enrichment analysis of annotated different expression gene was performed by Phyper (https://en.wikipedia.org/wiki/Hypergeometric_distributionopen in new window) based on Hypergeometric test. The significant levels of terms and pathways were corrected by Q value with a rigorous threshold (Q value ≤ 0.05) [11].

Softwares

SoftwareParameterReferencesSource
SOAPnuke (v1.5.6)-l 15 -q 0.2 -n 0.05[1:1]https://github.com/BGI-flexlab/SOAPnukeopen in new window
HISAT2 (v2.2.1)--sensitive --no-discordant --no-mixed -I 1 -X 1000 -p 8[2:1]http://www.ccb.jhu.edu/software/hisatopen in new window
Ericscript (v0.5.5-5)Default[3:1]http://ericscript.sourceforge.net/open in new window
rMATS (V4.1.1)Default[4:1]http://rnaseq-mats.sourceforge.netopen in new window
Bowtie2 (2.4.4)-q --sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1 -I 1 -X 1000 --no-mixed --no-discordant -p 1 -k 200[5:1]http://bowtie-bio.sourceforge.net/index.shtmlopen in new window
RSEM (v1.2.28)--forward-prob 0[6:1]http://deweylab.biostat.wisc.edu/rsemopen in new window
DESeq2Default[7:1]http://www.bioconductor.org/packages/release/bioc/html/DESeq2.htmlopen in new window
DEGseqDefault[8:1]http://bioinfo.au.tsinghua.edu.cn/software/degseq/open in new window
PheatmapDefault[10:1]https://cran.r-project.org/web/packages/pheatmap/open in new window
qvalueDefault[11:1]https://bioconductor.org/packages/release/bioc/html/qvalue.htmlopen in new window

References


  1. Li R, Li Y, Kristiansen K, Wang J. (2008). SOAP: short oligonucleotide alignment program. Bioinformatics. 24(5):713-4 ↩︎ ↩︎

  2. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357-360 (2015). ↩︎ ↩︎

  3. Matteo Benelli, Chiara Pescucci, Giuseppina Marseglia, Marco Severgnini, Francesca Torricelli, Alberto Magi, Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript, Bioinformatics, Volume 28, Issue 24, December 2012, Pages 3232–3239. ↩︎ ↩︎

  4. Shen, S. et al. rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc. Natl Acad. Sci. USA 111, E5593-E5601 (2014). ↩︎ ↩︎

  5. Langmead, B. et al. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357-359 (2012). ↩︎ ↩︎

  6. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). ↩︎ ↩︎

  7. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014). ↩︎ ↩︎

  8. Wang L. et al. (2010). DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics, Jan 1;26(1):136-8. ↩︎ ↩︎

  9. Audic, S. & Claverie, J. M. The significance of digital gene expression profiles. Genome Res. 7, 986-995 (1997). ↩︎

  10. Raivo Kolde. Package ‘pheatmap’. 2019-01-04 13:50:12 UTC. ↩︎ ↩︎

  11. Storey JD, Bass AJ, Dabney A, Robinson D (2021). qvalue: Q-value estimation for false discovery rate control. R package version 2.26.0, http://github.com/jdstorey/qvalueopen in new window ↩︎ ↩︎