IN51A-1784
Natural Language Processing and Machine Learning (NLP/ML): Applying Advances in Biomedicine to the Earth Sciences
Friday, 18 December 2015
Poster Hall (Moscone South)
Ruth Duerr1, Skatje Myers2, Martha Palmer3, Chris J Jenkins4, Anne Thessen1 and James Martin2, (1)Ronin Institute for Independent Scholarship, Westminster, CO, United States, (2)University of Colorado at Boulder, Computer Science, Boulder, CO, United States, (3)University of Colorado at Boulder, Linguistics, Boulder, CO, United States, (4)Organization Not Listed, Washington, DC, United States
Abstract:
Semantics underlie many of the tools and services available from and on the web. From improving search results to enabling data mashups and other forms of interoperability, semantic technologies have proven themselves. But creating semantic resources, especially re-usable semantic resources, is extremely time consuming and labor intensive. Why? Because it is not just a matter of technology but also of obtaining rough consensus if not full agreement amongst community members on the meaning and order of things. One way to develop these resources in a more automated way would be to use NLP/ML techniques to extract the required resources from large corpora of subject-specific text such as peer-reviewed papers where presumably a rough consensus has been achieved at least about the basics of the particular discipline involved. While not generally applied to Earth Sciences, considerable resources have been spent in other fields such as medicine on these types of techniques with some success. The NSF-funded ClearEarth project is applying the techniques developed for biomedicine to the cryosphere, geology, and biology in order to spur faster development of the semantic resources needed in these fields. The first area being addressed by the project is the cryosphere, specifically sea ice nomenclature where an existing set of sea ice ontologies are being used as the “Gold Standard” against which to test and validate the NLP/ML techniques. The processes being used, lessons learned and early results will be described.