MURAL - Maynooth University Research Archive Library



    uniForest: an unsupervised machine learning technique to detect outliers and restrict variance in microbiome studies


    Leigh, R.J. and Murphy, R.A. and Walsh, F. (2021) uniForest: an unsupervised machine learning technique to detect outliers and restrict variance in microbiome studies. Cold Spring Harbor perspectives in medicine. ISSN 2157-1422

    [img]
    Preview
    Download (1MB) | Preview


    Share your research

    Twitter Facebook LinkedIn GooglePlus Email more...



    Add this article to your Mendeley library


    Abstract

    Isolation Forests is an unsupervised machine learning technique for detecting outliers in continuous datasets that does not require an underlying equivariant or Gaussian distribution and is suitable for use on small datasets. While this procedure is widely used across quantitative fields, to our knowledge, this is the first attempt to solely assess its use for microbiome datasets. Here we present uniForest, an interactive Python notebook (which can be run from any desktop computer using the Google Colaboratory web service) for the processing of microbiome outliers. We used uniForest to apply Isolation Forests to the Healthy Human Microbiome project dataset and imputed outliers with the mean of the remaining inliers to maintain sample size and assessed its prowess in variance reduction in both community structure and derived ecological statistics (-diversity). We also assessed its functionality in anatomical site made available under aCC-BY 4.0 International license. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is bioRxiv preprint doi: https://doi.org/10.1101/2021.05.17.444491; this version posted May 17, 2021. The copyright holder for this preprint 2 differentiation (pre- and postprocessing) using principal component analysis, dissimilarity matrices, and ANOSIM. We observed a minimum variance reduction of 81.17% across the entire dataset and in alpha diversity at the Phylum level. Application of Isolation Forests also separated the dataset to an extremely high specificity, reducing variance within taxa samples by a minimum of 81.33%. It is evident that Isolation Forests are a potent tool in restricting the effect of variance in microbiome analysis and has potential for broad application in studies where high levels of microbiome variance is expected. This software allows for clean analyses of otherwise noisy datasets.

    Item Type: Article
    Additional Information: Cite as: uniForest: an unsupervised machine learning technique to detect outliers and restrict variance in microbiome studies R.J. Leigh, R.A. Murphy, F. Walsh bioRxiv 2021.05.17.444491; doi: https://doi.org/10.1101/2021.05.17.444491
    Keywords: uniForest; outliers
    Academic Unit: Faculty of Science and Engineering > Biology
    Faculty of Science and Engineering > Research Institutes > Human Health Institute
    Item ID: 17327
    Identification Number: :10.1101/2021.05.17.444491
    Depositing User: Dr Robert Leigh
    Date Deposited: 15 Jun 2023 12:44
    Journal or Publication Title: Cold Spring Harbor perspectives in medicine
    Publisher: CSHL press
    Refereed: Yes
    URI:
    Use Licence: This item is available under a Creative Commons Attribution Non Commercial Share Alike Licence (CC BY-NC-SA). Details of this licence are available here

    Repository Staff Only(login required)

    View Item Item control page

    Downloads

    Downloads per month over past year

    Origin of downloads