Hydra - an open source processing framework
With Lucene and Solr, the enterprise is finding the search engine to serve its information needs. With projects like Apache Nutch for web crawling and the Apache ManifoldCF project for extracting data from other source systems there is just one missing link in the open source enterprise search chain: cleaning and enriching the data. Hydra is a Findwise initiative aimed at producing a world-class document processing framework that is both light-weight enough to be used in a small search installation, as well as scalable enough to deliver processing for very large cases. Hydra is designed to be scalable, robust, flexible and easy to use. This talk details the architecture, and design of Hydra (based around MongoDB), and how it can be used to bridge the gap between source system and search engine. The talk will also discuss some of the possibilities that this new pipeline framework can offer, such as freely and seamlessly scaling up the solution during peak loads, metadata enrichment, as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.
Watch the video of Joel Westberg's talk here.