Gut & MicrobiomeResearch PaperOpen Access

AI Models Identify Key Bacteria That Define Healthy Oral Microbiome Niches

Random forest models using just 5 bacterial markers accurately distinguish gum, plaque, and saliva microbiomes in healthy adults.

Tuesday, June 30, 2026 1 view
Published in J Periodontol
Close-up of a dentist holding a sterile swab near an open mouth, with dental plaque collection tools and a microcentrifuge tube on a tray in the background, in a bright clinical dental office

Summary

Researchers analyzed 848 oral samples from 491 periodontally healthy adults, comparing the bacterial communities found in supragingival plaque, subgingival plaque, and saliva. Using 16S rRNA gene sequencing and random forest machine learning, they built predictive models to classify samples by oral niche using only five bacterial markers each. Models distinguishing plaque from saliva achieved over 95% accuracy and AUC above 0.986, while the model separating the two plaque types performed slightly lower but still well (AUC 0.908). Key bacteria like Fusobacterium, Treponema, and Prevotella were specific to subgingival plaque, while Oribacterium and Solobacterium characterized saliva. These niche-specific microbial signatures could serve as biomarkers for diagnostics and oral health monitoring.

Detailed Summary

Understanding the healthy oral microbiome is a prerequisite for detecting dysbiosis linked to periodontal disease and systemic conditions. Yet most 16S rRNA sequencing studies have focused on individual oral niches with small sample sizes, and none had previously applied supervised machine learning to classify healthy samples across all three major oral compartments simultaneously. This study addressed that gap by aggregating publicly available Illumina V3–V4 sequence data from 22 bioprojects into a unified, multi-batch analysis of 848 samples from 491 periodontally healthy adults.

Samples included 210 supragingival plaque specimens, 155 subgingival plaque specimens, and 483 saliva specimens. The bioinformatics pipeline used mothur for ASV inference — grouping only 100% identical sequences — and an oral-specific, curated expansion of the Human Oral Microbiome Database for taxonomy. Batch effects across studies were corrected prior to differential abundance analysis. After quality filtering, 10,577 ASVs remained for analysis. Centered log-ratio (CLR) transformation was applied to account for compositional data constraints.

Differential abundance analysis revealed 121 ASVs with significantly different abundances between supragingival and subgingival plaque (p < 0.01), 212 between supragingival plaque and saliva, and 160 between subgingival plaque and saliva. Despite the statistical significance, most plaque-versus-plaque differences involved taxa with small effect sizes, indicating the two plaque niches are more similar to each other than either is to saliva. PCA and PERMANOVA confirmed compositional clustering by niche, with the clearest separation between plaque and saliva.

Random forest models were trained on two-thirds of samples (supragingival n=140, subgingival n=104, saliva n=322) using 3-fold cross-validation, then tested on the remaining third. Each final model required only five ASVs. The supragingival-versus-subgingival model achieved AUC = 0.908, accuracy = 84.30%, sensitivity = 95.71%, and specificity = 68.63% on the test set. Both plaque-versus-saliva models performed substantially better: supragingival vs. saliva yielded AUC = 0.992, accuracy > 95%, sensitivity > 90%, and specificity > 95%; subgingival vs. saliva achieved AUC = 0.986 with similarly strong metrics. Predictive ASVs for subgingival plaque included species from Escherichia, Fusobacterium, Granulicatella, Treponema, Peptostreptococcaceae [XI][G-9], and Prevotella, while Oribacterium and Solobacterium were salivary markers.

The clinical implications are significant: a minimal set of bacterial markers can reliably identify the oral niche of a sample in healthy individuals, laying groundwork for microbiome-based diagnostic tools. These niche-specific signatures may eventually enable early detection of microbial shifts preceding periodontal disease. A major caveat is that the included studies had heterogeneous metadata quality — 59% were rated low quality — and only 20% used the modern 2018 periodontal classification system, which may introduce some inconsistency in how 'health' was defined across source datasets.

Key Findings

  • Random forest model distinguishing supragingival from subgingival plaque achieved AUC = 0.908, accuracy = 84.30%, sensitivity = 95.71%, and specificity = 68.63% using just 5 ASVs
  • Plaque-versus-saliva models performed even better: AUC = 0.992 (supragingival) and 0.986 (subgingival), each with accuracy > 95% and specificity > 95%
  • 121 ASVs showed differential abundance between supragingival and subgingival plaque (p < 0.01), but most had small effect sizes, indicating high similarity between plaque niches
  • 212 ASVs differed between supragingival plaque and saliva, and 160 between subgingival plaque and saliva (p < 0.01), reflecting greater compositional divergence from saliva
  • Fusobacterium, Treponema, Granulicatella, Prevotella, and Peptostreptococcaceae [XI][G-9] ASVs were identified as niche-specific markers of subgingival plaque in healthy subjects
  • Oribacterium and Solobacterium ASVs were identified as saliva-specific microbial signatures in periodontal health
  • 848 samples from 491 healthy adults across 22 bioprojects were analyzed — among the largest multi-batch 16S oral microbiome datasets assembled for periodontal health

Methodology

This cross-sectional observational study aggregated publicly available Illumina V3–V4 16S rRNA sequences from 22 bioprojects into a dataset of 848 samples (210 supragingival, 155 subgingival, 483 saliva) from 491 periodontally healthy adults. Sequences were processed with mothur using ASV-level resolution and taxonomy assigned via an oral-specific curated database; batch effects were corrected before differential abundance testing using Mann–Whitney–Wilcoxon with Benjamini–Hochberg correction and Cohen's d / Hedges' g effect size estimation. Random forest models were built using a genetic algorithm-initialized variable selection (sPLS-DA) on a 2/3 training set with 3-fold cross-validation and evaluated on a held-out 1/3 test set.

Study Limitations

The majority of included bioprojects (59%) had low-quality metadata, and only about 20% used the current 2018 periodontal classification system, introducing heterogeneity in how periodontal health was defined. The cross-sectional design and use of pre-existing public data limit causal inference and control over confounding variables such as age, diet, smoking, and antibiotic use. No conflicts of interest were reported by the authors, and funding was from Instituto de Salud Carlos III (PI24/00222).

Enjoyed this summary?

Get the latest longevity research delivered to your inbox every week.

Enter your email to subscribe: