Nutch is an open source web-search software. It builds on Lucene and Solr, adding web specifics, such as crawler, a link graph database, parser for HTML and other document formats.
Nutch can run on single machine but gains a lot of strength from running in Hadoop cluster. The system can be enhanced (eg. other format document can be parsed) using plugin mechanism.
Apache Hadoop is a Java software framework that supports data-intensive distributed applications under a free license.It enables applications to work with thousands of nodes and petabytes of data.
Hadoop is a top-level apache project, being built and used by a community of contributors from all over the world. Yahoo! has been the largest contributor to the project and uses Hadoop extensively across its businesses.
Hive is a data warehouse infrastructure built on top of hadoop. It provides tools to enable easy data ETL, a mechanism to put structure on-to data, and the capability to query and analyse large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users, familiar with the MapReduce framework, to plug-in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by built-in capabilities of the language.
Pig is a platform for analysing large data sets and consists of a high level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of pig programs is that their structure is amenable to substantial parallelization, which in turn enables them to handle very large data sets. At the present time, pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large scale parallel implementations already exist(eg., Hadoop subproject).
Solr is the popular, blazing fast, open source enterprise search platform from Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling (eg., word, PDF). Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.
Solr is written in Java and runs as a stand alone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search Library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to implement from virtually any programming language. Solr’s powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has extensive plugin architecture when more advanced customization is required.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full text search, specially cross-platform. In addition, Lucene has been widely recognized for its utility in the implementation of Internet search engine and local, single-site searching.
At the core of Lucene’s logical architecture is an idea of a document containing fields of text. This flexibility allows Lucene’s API to be independent of file format. Text from PDFs, HTML, MS Word, and OpenOffice documents, as well as many others can all be indexed so long as their textual information can be extracted.
We specialize in a bunch of ETL tools which includes- TALEND Open Studio, Kettle, Pentaho and Apatar. Talend is an open source data integration software vendor which produces several enterprise software products, including Talend Open Studio.
Talend’s product supports various enterprise-wide data integration and data quality needs such as data warehousing, business intelligence, data migration, data synchronization, data consolidation, operational data integration, data profiling, data cleansing, and master data management.
We have in-house expertise in Talend and have written our own customized Talend plugins for purposes like large XML parsing using XPP parser.