IN33A-1793
Using an Integrated Naive Bayes Calssifier for Crawling Relevent Data on the Web

Wednesday, 16 December 2015
Poster Hall (Moscone South)
Asitang Mihsra, NASA Jet Propulsion Laboratory, Pasadena, CA, United States
Abstract:
In our experiments (at JPL, NASA) for DARPA Memex project, we wanted to crawl a large amount of data for various domains. A big challenge was data relevancy in the crawled data. More than 50% of the data was irrelevant to the domain at hand. One immediate solution was to use good seeds (seeds are the initial urls from where the program starts to crawl) and make sure that the crawl remains into the original host urls. This although a very efficient technique, fails under two conditions. One when you aim to reach deeper into the web; into new hosts (not in the seed list) and two when the website hosts myriad content types eg. a News website.

The relevancy calculation used to be a post processing step i.e. once we had finished crawling, we trained a NaiveBayes Classifier and used it to find a rough relevancy of the web pages that we had. Integrating the relevancy into the crawling rather than after it was very important because crawling takes resources and time. To save both we needed to get an idea of relevancy of the whole crawl during run time and be able to steer its course accordingly. We use Apache Nutch as the crawler, which uses a plugin system to incorporate any new implementations and hence we built a plugin for Nutch.

The Naive Bayes Parse Plugin works in the following way. It parses every page and decides, using a trained model (which is built in situ only once using the positive and negative examples given by the user in a very simple format), if it is relevant; If true, then it allows all the outlinks from that page to go to the next round of crawling; If not, then it gives the urls a second chance to prove themselves by checking some commonly expected words in the url relevant to that domain. This two tier system is very intuitive and efficient in focusing the crawl. In our initial test experiments over 100 seed urls, the results were astonishingly good with a recall of 98%.

The same technique can be applied to geo-informatics. This will help scientists gather data that is relevant to their specific domain. As a proof of concept we also crawled nsidc.org and some similar websites and were very efficiently able to keep the crawler from going into the hub websites like Yahoo, commercial/advertising portals and irrelevant content pages.

It is a strong start towards focused crawling using Nutch, one of the most scalable and ever evolving crawler available today.