Pentaho and Hadoop

Visual Development for Hadoop Data Prep and Modeling

Pentaho’s visual development tools drastically reduce the time to design, develop and deploy Hadoop analytics solutions by as much as 15x, compared to traditional custom coding and ETL approaches.

This is made possible with the following capabilities:

  • Powerful visual user interface for ingesting and manipulating data within Hadoop, and makes it easy to enrich Hadoop data with reference data from other sources.
  • Option of accessing Hadoop data either directly, or through rapid visual extraction into data marts/warehouses optimized for fast response times.
  • A visual tool for defining business metadata models helps developers prepare their data for analytics.

Interactive visualization and exploration Pentaho enables your IT department to share interactive data visualizations and exploration capabilities, allowing business users to build dashboards and run reports on their own.The Hadoop data integration and business analytics platform enables users to easily review data through:

  • Rich visualization – Interactive, web-based interfaces for ad hoc reporting, charting and dashboards.
  • Flexible exploration – View data through dimensions such as time, product and geography, and across measures like revenue and quantity.
  • Predictive analytics – Use advanced statistical algorithms such as classification, regression, clustering and association rules to spot trends or challenges ahead of time.

Pentaho Visual MapReduce: Scalable in-Hadoop ExecutionPentaho’s Java-based data integration engine works with the Hadoop cache for automatic deployment as a MapReduce task across every data node in a Hadoop cluster, making use of the massive parallel processing power of Hadoop.Pentaho can natively connect to Hadoop in the following ways:

  • HDFS – Input and output directly to the Hadoop Distributed File System.
  • MapReduce – Input and output directly to MapReduce programs.
  • HBase – Input and output directly to HBase, a NoSQL database optimized for use with Hadoop.
  • Hive – A JDBC driver that enables interaction with Hadoop via the Hadoop Query Language (HQL), a SQL-like query and data description language (DDL).
  • YARN - Data integration jobs can make elastic use of Hadoop resources via YARN

Multi-Threaded Engine for Faster Execution: The Pentaho Data Integration engine is multi-threaded, with each step in a job executing on one or more threads. Multi-core processors running on each data node of the cluster are fully leveraged, eliminating the need for specialized, multi-threaded programming techniques.In addition, the Pentaho Data Integration engine executes as a single MapReduce task, instead of the multiple tasks typical of machine-generated or hand-coded programs and Pig scripts. As a result, Pentaho MapReduce jobs typically execute many times faster than other methods.The table below compares performing common Hadoop tasks using traditional MapReduce programming skills and Pentaho’s visual interface:

MapReduce programming Vs. Pentaho’s visual interfaces for ingesting, manipulating and extracting data

Intuitive Data Orchestration Interface: Data orchestration is made easy via Pentaho’s library of graphical job flow components, which include conditional checking, event waiting, execution and notification.

Pentaho's Graphical Job Flow Components

Together, these components can be combined to enable visual assembly of powerful job flow logic across multiple jobs and data sources. Pentaho provides graphical drag-and-drop components for Hadoop ecosystem projects such as Sqoop and Oozie, drastically reducing the amount of time needed to use these powerful bulk-data load and workflow utilities:

  • Sqoop – A tool designed for efficiently transferring bulk data between Hadoop and structured data stores, such as relational databases.
  • Oozie – An open-source workflow/coordination service that manages data processing jobs for Hadoop.

Deep Support for Hadoop: Pentaho fully supports the leading Hadoop-based distributions and supports native capabilities. Several distributions of Hadoop are available as open source projects and from commercial providers.

Hadoop distributions