Using t-Distributed Stochastic Neighborhood Embedding (t-SNE) to Investigate Ecological Physiology of Zooplankton

Petra Lenz1, Matthew C Cieslak2, Ann M Castelfranco1, Vittoria Roncalli3 and Daniel K Hartline1, (1)University of Hawaii at Manoa, Pacific Biosciences Research Center, Honolulu, HI, United States, (2)University of Hawaii, Pacific Biosciences Research Center, Honolulu, HI, United States, (3)Stazione zoologica A. Dohrn, Naples, Italy
Abstract:
High-throughput sequencing technologies are transforming the field of plankton ecology. Transcriptomics and meta-transcriptomics studies are providing detailed assessments of physiological responses to environmental conditions. However, one current limitation of the technology is the complexity of the short-read sequencing data that it generates. New bioinformatics tools are needed to more efficiently guide experimentation and sampling, to better focus analysis, and to help develop and test hypotheses. One complexity-reducing tool that has been used successfully in other fields is "t-distributed Stochastic Neighbor Embedding" (t-SNE). Here, we test the potential value of this tool to zooplankton research by evaluating its application to RNA-Seq data that had been analyzed by conventional workflows in published studies on the copepods Calanus finmarchicus and Neocalanus flemingeri. By reducing multi-dimensional gene expression data into two dimensions, t-SNE successfully clustered samples into groups by developmental stage, experimental treatment, or collection region. These results were validated by reference to the published studies, which used differential gene expression and gene ontology (GO) analyses. The t-SNE results demonstrate how individual samples can be evaluated for differences in global gene expression, as well as differences in expression related to specific biological processes, such as lipid metabolism and responses to stress. As RNA-Seq data from plankton species and communities become more common, t-SNE analysis should provide a powerful tool for determining trends and classifying samples into groups with similar transcriptional physiology, independent of collection site or time.