Data Source
Data source
Reference gene
mRNA: Collect from NCBI RefSeq annotation or other databases
lncRNA: Collect from NCBI RefSeq annotation or other databases, and RNAcentral database.
LncRNA from RNAcentral database has RNA ID but Gene ID and positional information are not offered. The positional information was determined by aligning the sequences to the genome using blast. There are two ways to determine its gene ID. First, the lncRNA is compared with the known RNA from NCBI. If there is an overlapping region with the known RNA, the gene ID corresponding to this region is used. If there is no overlapping region with the known RNA, the new gene is assigned a gene ID starting with ‘BGIG’.
miRNA: collect from miRbase 22 and some of the miRNAs are predicted using BGI internal data. The prediction software is miRDeep for animals and miRDeep-P2 for plants. Predicted miRNA is assigned a new gene ID, starting with ‘novel’.
miRNA target gene prediction: Multiple software are used for prediction, combined with corresponding filtering conditions such as free energy, score values, etc. Generally speaking, we use RNAhybrid, miRanda, and TargetScan to predict animal target genes, and Tapir and TargetFinder to predict plant target genes. The default parameters of the target gene prediction software are as follows:
- miRanda: -en -20 -strict
- RNAhybrid: -b 100 -c -f 2,8 -m 100000 -v 3 -u 3 -e -20 -p 1 -s 3utr_human
- TargetScan: Default
- Tapir: --score 5 --mfe_ratio 0.6
- TargetFinder: -c 4
circRNA: from circBase. The positional information is determined by aligning the sequences to the genome using blast.
Annotation
KEGG: the newest version is v102.0
GO: from three databases:
Annotation of Uniprot protein:
http://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gzGene2GO from NCBI:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gzidmapping from GeneOncology (Downloaded on May, 2020):
ftp://ftp.pir.georgetown.edu/databases/idmapping/idmapping.tb.gz
Transcription factors annotation (TF Desc):
Animal: http://bioinfo.life.hust.edu.cn/AnimalTFDB/#!/
Plant: http://planttfdb.gao-lab.org/
Transcription cofactors annotation (TF Cofactors Desc):
Animal: AnimalTFDB v3.0
MsigDB annotation: v7.1
http://software.broadinstitute.org/gsea/msigdb/
Genebank (GeneBank Desc): collected from NCBI
Interpro (InterPro Desc),pfam (Pfam Desc),EggNOG (EggNOG Desc) annotation:
idmapping from GeneOncology (Downloaded on May, 2020)
ftp://ftp.pir.georgetown.edu/databases/idmapping/idmapping.tb.gz
Reactome (Reactome Desc): Extracted through the official mapping relationship of NCBI2Reactome_PE_All_Levels.txt.
https://reactome.org/download-data (Downloaded on June, 2020)
CR2Cancer (CR2Cancer Desc): http://cis.hku.hk/CR2Cancer/
CellMarker (CellMarker Desc): http://biocc.hrbmu.edu.cn/CellMarker/