Eukaryotic genome discovery: Scalable and automated retrieval of eukaryotic metagenome assembled genomes (MAGs) from a global-scale dataset

Harriet Alexander, Woods Hole Oceanographic Institution, Biology, Woods Hole, United States and Sarah K Hu, Woods Hole Oceanographic Institution, Marine Chemistry and Geochemistry, Woods Hole, MA, United States
Abstract:
Protists, or eukaryotic microbes, are key players in marine ecosystems, encompassing primary producers, mixotrophs, and heterotrophs. Similar to their prokaryotic microbial counterparts, many protists have evaded cultivation, making the direct study of their biology in the lab challenging. Molecular and genomic approaches, particularly those applied to whole, mixed communities (e.g. metagenomics, metatranscriptomics), have shed light on the ecological roles, evolutionary histories, and physiological capabilities of these organisms. We developed a scalable and reproducible pipeline to facilitate the retrieval, taxonomic assignment, and annotation of eukaryotic metagenome assembled genomes (MAGs) from mixed community metagenomes. We applied this pipeline to metagenomic data from the Tara expedition protist-size fractions (0.8–2000 μm; encompassing more than 20Tb of raw sequence data). This analysis has yielded over 10,000 candidate MAGs, more than 100 of which were identified as high quality eukaryotic MAGs (with more than 50% of putative single copy eukaryotic core genes present within the eukaryotic MAG). Notably, the eukaryotic MAGs recovered in this study were distinct from cultured reference data, with less than 10% of eukaryotic MAGs having an average protein identity >70% to available references. This finding highlights the computational and informatic limitations placed upon the study of eukaryotic microbes relative to their prokaryotic counterparts and the need for better environmentally-relevant eukaryotic genetic references (e.g. SAGs, genomes, or transcriptomes). Currently, we are leveraging these eukaryotic MAGs to derive insights into population-level variability and biogeography of keystone marine protists across the global ocean. Accessible and scalable computational tools, such as the one described here, are likely to accelerate the identification of meaningful genetic signatures from large datasets and may facilitate the integration of these data with biogeochemical modelling efforts.