Skip to contents

Overview

The human microbiome, particularly in the large intestine, is gaining attention for its role in health and disease, including conditions like colorectal cancer and inflammatory bowel disease. Various factors, including genetics and ethnicity, influence the human microbiome, often tied to geographic regions. As a result, individuals from different global regions tend to have distinct gut microbiomes. While research has shown significant variation in the gut microbiome across countries and social groups, much of the microbiome literature has focused on a limited range of subjects, raising questions about the generalizability of these findings.

To address these limitations, researchers have turned to publicly available microbiome data, revealing an overrepresentation of high-income countries like the United States and Finland in major microbiome research repositories. However, significant variations from the Western baseline are observed when other communities are studied. This suggests that the current understanding of the microbiome may be limited due to geographic gaps in research.

To bridge this gap and enhance the field’s bioinformatic capabilities, the Human Microbiome Compendium is introduced. It comprises over 168,000 human gut microbiome samples from 68 countries, processed using advanced tools and made accessible in various formats, including the MicroBioMap R package and a dedicated website. This dataset is used to analyze global patterns in microbiome composition and identify areas where our knowledge of the human gut microbiome is lacking.

Primary data processing

Amplicon processing was conducted according to the following criteria: If the quantity of forward read files corresponded with the number of reverse read files, we proceeded with paired-end sequencing analysis. However, in cases of a mismatch or the absence of reverse reads, the project was treated as single-ended data. In both scenarios, we employed DADA2 version 1.14.0 for data processing (B. J. Callahan et al. 2016). We utilized broad parameter settings intended to maximize the inclusion of samples while excluding low-quality data: Specifically, we did not apply fixed trimming from either end or impose a maximum read length. Reads shorter than 20 nucleotides, those containing ambiguous (“N”) base calls, and reads aligning to the phiX genome (typically present as a control in Illumina sequencing runs) were eliminated. Additionally, quality-based truncation of reads was disabled. Paired-end reads were merged with a minimum overlap of 20 bases. In certain instances, the merging process failed, resulting in minimal or no merging of forward and reverse reads, often associated with sequencing strategies featuring non-overlapping reads. For studies where less than 50 percent of forward reads merged successfully, we opted to discard the reverse reads rather than concatenate them. This decision aimed to prevent failures in merging due to low-quality calls or discrepancies between forward and reverse read files. In such cases, the reverse reads were excluded, and the projects were reprocessed as single-ended data. When the number of forward reads did not align with the number of reverse reads within a sample, an attempt was made to employ DADA2 to identify the sequence identifier field in the FASTQ file, facilitating the alignment of salvageable samples. In cases where this attempt proved unsuccessful, the reverse reads were removed, and the data was reprocessed as single-ended. Taxonomic assignment was carried out by DADA2, utilizing the SILVA database release 138.1 as a reference (Quast et al. 2013; McLaren and Callahan 2021).

Getting started

Installation

library(BiocManager)
BiocManager::install("blekhmanlab/MicroBioMap")

Basic usage

## Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
##   object 'type_sum.accel' not found
# this operation requires about 4GB of RAM
cpd <- getCompendium()
## [1] "Retrieving compendium version 1.1.0"
cpd
## class: TreeSummarizedExperiment 
## dim: 4680 168464 
## metadata(0):
## assays(1): counts
## rownames(4680):
##   Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter
##   Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella
##   ...
##   Bacteria.Bacteroidota.Bacteroidia.Flavobacteriales.Flavobacteriaceae.Aurantiacicella
##   Archaea.Halobacteriota.Halobacteria.Halobacterales.Halobacteriaceae.Halocalculus
## rowData names(8): kingdom phylum ... species strain
## colnames(168464): PRJDB10485_DRR243823 PRJDB10485_DRR243824 ...
##   PRJNA764337_SRR15959700 PRJNA764337_SRR15959701
## colData names(11): srs project ... iso region
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

Sample metadata

The colData slot of the cpd object contains the information about samples. We can examine the names of the available metadata:

names(colData(cpd))
##  [1] "srs"              "project"          "srr"              "library_strategy"
##  [5] "library_source"   "pubdate"          "total_bases"      "instrument"      
##  [9] "geo_loc_name"     "iso"              "region"

We can look at the samples by available regions:

  • Region

    table(colData(cpd)$region)
    ## 
    ##            Australia/New Zealand        Central and Southern Asia 
    ##                             3916                             5602 
    ##   Eastern and South-Eastern Asia      Europe and Northern America 
    ##                            18461                           103564 
    ##  Latin America and the Caribbean Northern Africa and Western Asia 
    ##                             1215                             1557 
    ##                          Oceania               Sub-Saharan Africa 
    ##                                4                             5953 
    ##                          unknown 
    ##                            28192
  • Library strategy

    table(colData(cpd)$library_strategy)
    ## 
    ## AMPLICON 
    ##   168464
  • Instrument

    table(colData(cpd)$instrument)
    ## 
    ##                454 GS   Illumina HiSeq 1500   Illumina HiSeq 2000 
    ##                  8790                   793                  2447 
    ##   Illumina HiSeq 2500   Illumina HiSeq 3000   Illumina HiSeq 4000 
    ##                 10975                   701                  2651 
    ##      Illumina MiniSeq        Illumina MiSeq Illumina NovaSeq 6000 
    ##                   438                134353                   920 
    ##           NextSeq 500           NextSeq 550           unspecified 
    ##                   820                   315                  5261
  • Country

    head(sort(table(colData(cpd)$geo_loc_name), decreasing = TRUE))
    ## 
    ##        usa:new york                                     usa             missing 
    ##               14142               12880               12510                5532 
    ## denmark: copenhagen      not applicable 
    ##                5289                4317

Microbe (row) metadata

Each row in cpd represents one microbe. The rowData contains a data.frame- like set of metadata with columns kingdom, phylum, etc.

head(rowData(cpd))
## DataFrame with 6 rows and 8 columns
##                                                                                                kingdom
##                                                                                            <character>
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter    Bacteria
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella         Bacteria
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Adlercreutzia         Bacteria
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Senegalimassilia      Bacteria
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.Holdemanella                Bacteria
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.NA                          Bacteria
##                                                                                                    phylum
##                                                                                               <character>
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter Actinomycetota
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella      Actinomycetota
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Adlercreutzia      Actinomycetota
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Senegalimassilia   Actinomycetota
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.Holdemanella                  Bacillota
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.NA                            Bacillota
##                                                                                                     class
##                                                                                               <character>
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter Coriobacteriia
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella      Coriobacteriia
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Adlercreutzia      Coriobacteriia
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Senegalimassilia   Coriobacteriia
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.Holdemanella                    Bacilli
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.NA                              Bacilli
##                                                                                                         order
##                                                                                                   <character>
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter   Coriobacteriales
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella        Coriobacteriales
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Adlercreutzia        Coriobacteriales
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Senegalimassilia     Coriobacteriales
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.Holdemanella             Erysipelotrichales
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.NA                       Erysipelotrichales
##                                                                                                         family
##                                                                                                    <character>
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter        Atopobiaceae
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella        Coriobacteriaceae
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Adlercreutzia          Eggerthellaceae
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Senegalimassilia       Eggerthellaceae
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.Holdemanella             Erysipelotrichaceae
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.NA                       Erysipelotrichaceae
##                                                                                                            genus
##                                                                                                      <character>
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter Tractidigestivibacter
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella                Collinsella
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Adlercreutzia              Adlercreutzia
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Senegalimassilia        Senegalimassilia
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.Holdemanella                      Holdemanella
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.NA                                          NA
##                                                                                                species
##                                                                                            <character>
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter          NA
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella               NA
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Adlercreutzia               NA
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Senegalimassilia            NA
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.Holdemanella                      NA
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.NA                                NA
##                                                                                                 strain
##                                                                                            <character>
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter          NA
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella               NA
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Adlercreutzia               NA
## Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Eggerthellaceae.Senegalimassilia            NA
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.Holdemanella                      NA
## Bacteria.Bacillota.Bacilli.Erysipelotrichales.Erysipelotrichaceae.NA                                NA

Use cases

Samples in specific Bioprojects

To select samples that belong to specific Bioprojects, we can filter using the “project” metadata column in the colData.

head(unique(colData(cpd)$project))
## [1] "PRJDB10485" "PRJDB10527" "PRJDB10528" "PRJDB10612" "PRJDB11845"
## [6] "PRJDB11894"

For example, create a subset of data belonging to Bioprojects:

  • PRJDB10485
  • PRJDB10527
  • PRJDB10528
  • PRJDB10612
project_ids <- c("PRJDB10485", "PRJDB10527", "PRJDB10528", "PRJDB10612")
cpd_sub_by_project_ids <- cpd[, colData(cpd)$project %in% project_ids]
cpd_sub_by_project_ids
## class: TreeSummarizedExperiment 
## dim: 4680 637 
## metadata(0):
## assays(1): counts
## rownames(4680):
##   Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter
##   Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella
##   ...
##   Bacteria.Bacteroidota.Bacteroidia.Flavobacteriales.Flavobacteriaceae.Aurantiacicella
##   Archaea.Halobacteriota.Halobacteria.Halobacterales.Halobacteriaceae.Halocalculus
## rowData names(8): kingdom phylum ... species strain
## colnames(637): PRJDB10485_DRR243823 PRJDB10485_DRR243824 ...
##   PRJDB10612_DRR254710 PRJDB10612_DRR254711
## colData names(11): srs project ... iso region
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

Samples from a specific world region

To select a subset of samples from a specific world region, subset by the “columns” of the dataset. The following code ummarized the regions are available and the number of samples in each region.

sort(table(colData(cpd)$region), decreasing = TRUE)
## 
##      Europe and Northern America                          unknown 
##                           103564                            28192 
##   Eastern and South-Eastern Asia               Sub-Saharan Africa 
##                            18461                             5953 
##        Central and Southern Asia            Australia/New Zealand 
##                             5602                             3916 
## Northern Africa and Western Asia  Latin America and the Caribbean 
##                             1557                             1215 
##                          Oceania 
##                                4

Now, select only those samples from, for example, “Sub-Saharan Africa”.

cpd_africa <- cpd[, colData(cpd)$region == "Sub-Saharan Africa"]
cpd_africa
## class: TreeSummarizedExperiment 
## dim: 4680 5953 
## metadata(0):
## assays(1): counts
## rownames(4680):
##   Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Atopobiaceae.Tractidigestivibacter
##   Bacteria.Actinomycetota.Coriobacteriia.Coriobacteriales.Coriobacteriaceae.Collinsella
##   ...
##   Bacteria.Bacteroidota.Bacteroidia.Flavobacteriales.Flavobacteriaceae.Aurantiacicella
##   Archaea.Halobacteriota.Halobacteria.Halobacterales.Halobacteriaceae.Halocalculus
## rowData names(8): kingdom phylum ... species strain
## colnames(5953): PRJDB10485_DRR243823 PRJDB10485_DRR243824 ...
##   PRJNA706727_SRR13896130 PRJNA706727_SRR13896131
## colData names(11): srs project ... iso region
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

Samples from a specific country

To select a subset of samples from a specific country, subset by the “columns” of the dataset. The following code ummarized the countries are available and the number of samples in each country.

library(forcats)
library(ggplot2)
ggplot(as.data.frame(colData(cpd)),
    aes(x = fct_infreq(iso))) +
    geom_bar(stat='count') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(x = "Country", y = "Number of samples") +
    scale_y_log10()

Enrich data with country names

Using the countrycode package, we can convert the iso codes to country names.

library(countrycode)
colData(cpd)$country <- countrycode(colData(cpd)$iso, "iso2c", "country.name")
## Warning: Some values were not matched unambiguously: UNKNOWN
# or add the iso3c code
colData(cpd)$iso3c <- countrycode(colData(cpd)$iso, "iso2c", "iso3c")
## Warning: Some values were not matched unambiguously: UNKNOWN

Histogram of total bases per sample

library(ggplot2)
ggplot(as.data.frame(colData(cpd)), aes(x = log10(total_bases))) +
    geom_histogram()

Samples filtered by presence of microbe

There are over 4000 microbes represented in the compendium. Some are fairly rare and we may want to subset the compendium to only those with a specific bug in the sample. Here, we pick an arbitrary examplar to show just the mechanics.

shig_cpd_counts = counts(cpd)['Bacteria.Bacillota.Clostridia.Eubacteriales.Alkalibacteraceae.Alkalibaculum',]

Examine the distribution of abundance across all samples in the compendium.

hist(log10(shig_cpd_counts))

Picking an arbitrary threshold (after all, this is only an example), we can limit samples to those with plenty of our bacteria of interest present.

shig_cpd <- cpd[, log10(shig_cpd_counts)>2]
dim(shig_cpd)
## [1] 4680    0

sessionInfo

## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] countrycode_1.6.0               ggplot2_3.5.1                  
##  [3] forcats_1.0.0                   MicroBioMap_0.99.13            
##  [5] ape_5.8-1                       TreeSummarizedExperiment_2.14.0
##  [7] Biostrings_2.74.1               XVector_0.46.0                 
##  [9] SingleCellExperiment_1.28.1     SummarizedExperiment_1.36.0    
## [11] Biobase_2.66.0                  GenomicRanges_1.58.0           
## [13] GenomeInfoDb_1.42.1             IRanges_2.40.1                 
## [15] S4Vectors_0.44.0                BiocGenerics_0.52.0            
## [17] MatrixGenerics_1.18.0           matrixStats_1.4.1              
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        farver_2.1.2            dplyr_1.1.4            
##  [4] blob_1.2.4              filelock_1.0.3          R.utils_2.12.3         
##  [7] fastmap_1.2.0           lazyeval_0.2.2          BiocFileCache_2.14.0   
## [10] digest_0.6.37           lifecycle_1.0.4         tidytree_0.4.6         
## [13] RSQLite_2.3.9           magrittr_2.0.3          compiler_4.4.2         
## [16] rlang_1.1.4             sass_0.4.9              tools_4.4.2            
## [19] yaml_2.3.10             data.table_1.16.4       knitr_1.49             
## [22] labeling_0.4.3          S4Arrays_1.6.0          htmlwidgets_1.6.4      
## [25] bit_4.5.0.1             curl_6.0.1              DelayedArray_0.32.0    
## [28] abind_1.4-8             BiocParallel_1.40.0     withr_3.0.2            
## [31] purrr_1.0.2             desc_1.4.3              R.oo_1.27.0            
## [34] grid_4.4.2              colorspace_2.1-1        scales_1.3.0           
## [37] cli_3.6.3               rmarkdown_2.29          crayon_1.5.3           
## [40] ragg_1.3.3              treeio_1.30.0           generics_0.1.3         
## [43] rstudioapi_0.17.1       httr_1.4.7              DBI_1.2.3              
## [46] cachem_1.1.0            zlibbioc_1.52.0         parallel_4.4.2         
## [49] vctrs_0.6.5             yulab.utils_0.1.8       Matrix_1.7-1           
## [52] jsonlite_1.8.9          bit64_4.5.2             systemfonts_1.1.0      
## [55] jquerylib_0.1.4         tidyr_1.3.1             glue_1.8.0             
## [58] pkgdown_2.1.1           codetools_0.2-20        gtable_0.3.6           
## [61] UCSC.utils_1.2.0        munsell_0.5.1           tibble_3.2.1           
## [64] pillar_1.10.0           htmltools_0.5.8.1       GenomeInfoDbData_1.2.13
## [67] R6_2.5.1                dbplyr_2.5.0            textshaping_0.4.1      
## [70] evaluate_1.0.1          lattice_0.22-6          R.methodsS3_1.8.2      
## [73] memoise_2.0.1           bslib_0.8.0             Rcpp_1.0.13-1          
## [76] SparseArray_1.6.0       nlme_3.1-166            xfun_0.49              
## [79] fs_1.6.5                pkgconfig_2.0.3