Specied distribution

Dr.TomAbout 17 wordsLess than 1 minute

Species Distribution

The species distribution of microbial communities is a basic feature. Through appropriate visualization methods, the species composition of samples can be intuitively displayed, facilitating comparisons between samples/groups. Common visualization methods include stacked charts, abundance heatmaps, GraPhlAn charts, Circos charts, Krona charts, and pie charts.

Stacked Charts and Pie Charts

Stacked charts and pie charts are traditional data visualization tools. Pie charts can only display data for one sample/group, while stacked charts can display data for multiple samples/groups.

Abundance Heatmap

Heatmaps are used to display the distribution of species in different samples/groups. Since different species and samples/groups are placed in one chart, it is very convenient to compare differences between species and samples in heatmaps.

GraPhlAn Chart

GraPhlAnopen in new window [1] is a circular visualization for multi-level data. It displays hierarchical relationships in the form of multiple rings, and the heatmap information can be added on the outer ring, making it very suitable for species classification visualization.

Circos Chart

Circos Chartopen in new window [2] (also called chord diagram) displays groups and species composition on the same circle and connects species and groups to show the composition of different microbes in groups/samples and the abundance differences of microbes in different samples/groups.

Krona Chart

Krona Chartopen in new window [3] is a visualization of multi-level data using Krona software to generate an interactive result in HTML. Each ring in the Krona chart is a discrete pie chart. The HTML interaction feature allows users to perform more interactive displays of data.

FAQ

Q:How to process data in statistical distribution Species?

A:Showing too many elements in a single image makes the image appear crowded and it is difficult to get useful information from it. Based on this, the abundance data Species used for mapping need to be screened:

  • Classify to others: Keep the data that meets the filter conditions, and classify the data that does not meet the conditions into others.
  • SpeciesFiltering: Keep the data that meets the filtering conditions, and discard the data that does not meet the conditions.

Filter criteria:

  • The top N relative abundanceSpecies
  • The relative abundace greater than mSpecies

Tips

Species Both 'Classify to others' and' filter 'are based on the parameters set in the analysis scheme. However, due to the reason of graphics display, we have limited the maximum number of Species displayed for some graphics. For details, please check the corresponding page description.

Q:What are the clustering distance and clustering method of the heatmap? How do different clustering distances and clustering methods differ? What are the commonly used clustering distances and clustering methods?

A:The purpose of clustering is to identify a subset of discontinuous objects, that is, clustering is to group data sets. The result of microbial clustering is a hierarchical clustering tree with a nested structure. Most clustering is based on distance. Dr.Tom system provides six clustering distances: euclidean, maximum, manhattan, canberra, binary and correlation. The calculation formulas and differences are as follows:

MethodsFormulasintroduction
euclideandeuc(x,y)=i=1n(xiyi)2d_{euc}(x,y) = \sqrt{\sum_{i=1}^n(x_i - y_i)^2}Euclidean distance: The square root of the sum of the squared differences of all objects between groups.
maximumDche(x,y)=miax(xiyi)D_{che}(x,y) = \underset{i} max(\vert x_i - y_i\vert)Chebyshev distance: the maximum absolute value of the difference between the coordinates of objects in two groups
manhattandman(x,y)=i=1n(xiyi)d_{man}(x,y) = \sum_{i=1}^n \vert{(x_i - y_i)\vert}Manhattan distance: Sum of absolute differences. This distance can be used when the group has more data types, such as age, gender, and height
canberradcan(x,y)=i=1nXiYiXi+Yid_{can}(x,y) = \sum_{i=1}^{n}{\frac{\vert X_{i} - Y_{i}\vert } {\vert X_{i}\vert + \vert Y_{i}\vert }}Canberra distance: This distance can be used when the samples are relatively similar
binarydbin(x,y)=1aa+b+cd_{bin}(x, y)=1-\frac{a}{a+b+c}Jacquard dissimilarity: a: number present in both samples; b: number of species present in one sample; c: number of species present in the other sample; d: number of species present in neither sample sample
correlationrx,y=cov(x,y)σxσyr_{x, y} = \frac{cov(x, y)}{\sigma_x\sigma_y}Pearson Correlation: Used in correlation heatmaps

Reference:

  • Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
  • Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis. Academic Press.
  • Borg, I. and Groenen, P. (1997) Modern Multidimensional Scaling. Theory and Applications. Springer.
Q:What are the commonly used methods for heatmap clustering? What is the application of each method?

A:The system provides three different methods: connection-based hierarchical clustering, average aggregation clustering and minimum variance clustering.

  • Connection-based hierarchical clustering: determine the nearest connection of objects based on the longest or shortest distance between objects in two groups, including single connection and full connection two types
  • Average aggregation clustering: According to whether to calculate the weight (whether to calculate the number of objects in the group) and the distance calculation method (average distance: the average distance between the added object and the existing object; centroid distance: the geometric center of the distance), it is divided into UPGMA , UPGMC, WPGMA, WPGMA four types
  • Minimum variance clustering: Based on the least squares linear model, the sum of squares within the group is minimized

Each type of method includes at least two specific methods, as shown in the table below. The method marked with an asterisk indicates that the method is a common method for metagenomics

TypeMethodFeature
Connection-based hierarchical clusteringsingle *
complege *
Average aggregation clusteringUPGMA *Arithmetic average - equal weight
UPGMCArithmetic average - Equal weight
WPGMACentroid clustering - unequal weights
WPGMCCentroid clustering - unequal weights
Minimum variance clusteringward.D
ward.D2

Reference


  1. Asnicar, F., Weingart, G., Tickle, T. L., Huttenhower, C., & Segata, N. (2015). Compact Graphical Representation of Phylogenetic Data and Metadata with Graphlan. PeerJ, 3, e1029. https://doi.org/10.7717/peerj.1029open in new window ↩︎

  2. Krzywinski, M., Schein, J., Birol, İ., Connors, J., Gascoyne, R., Horsman, D., Jones, S. J., & Marra, M. A. (2009). Circos: An Information Aesthetic for Comparative Genomics. Genome Research, 19(9), 1639–1645. https://doi.org/10.1101/gr.092759.109open in new window ↩︎

  3. Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive Metagenomic Visualization in a Web Browser. BMC Bioinformatics, 12, 385. https://doi.org/10.1186/1471-2105-12-385open in new window ↩︎