Pentaho and Hadoop

In-Hadoop execution, from data prep to predictive analytics

The Pentaho Business Analytics platform provides Hadoop users with visual development tools and big data analytics to easily prepare, model, visualize and explore data sets. From data preparation and configuration to predictive analytics, Pentaho covers the data lifecycle from end-to-end with a complete solution to your business intelligence needs.

Visual Development for Hadoop Data Prep and Modeling

Pentaho’s visual development tools drastically reduce the time to design, develop and deploy Hadoop analytics solutions by as much as 15x, compared to traditional custom coding and ETL approaches.

  • Powerful visual user interface for ingesting and manipulating data within Hadoop, and makes it easy to enrich Hadoop data with reference data from other sources.
  • Option of accessing Hadoop data either directly, or through rapid visual extraction into data marts/warehouses optimized for fast response times.
  • A visual tool for defining business metadata models helps developers prepare their data for analytics.

Interactive Visualization and Exploration 

Pentaho enables your IT department to share interactive data visualizations and exploration capabilities, allowing business users to build dashboards and run reports on their own.

The Hadoop data integration and business analytics platform enables users to easily review data through:

  • Rich visualization – Interactive, web-based interfaces for ad hoc reporting, charting and dashboards.
  • Flexible exploration – View data through dimensions such as time, product and geography, and across measures like revenue and quantity.
  • Predictive analytics – Use advanced statistical algorithms such as classification, regression, clustering and association rules to spot trends or challenges ahead of time.

Instant Analysis

Pentaho Instaview takes data from raw numbers to interactive visualization in only minutes. Preparation of Hadoop data for analysis is greatly simplified and automated, reducing the analytics cycle from days or weeks to minutes or hours.   

Pentaho Visual MapReduce: Scalable in-Hadoop Execution

Pentaho’s Java-based data integration engine works with the Hadoop cache for automatic deployment as a MapReduce task across every data node in a Hadoop cluster, making use of the massive parallel processing power of Hadoop.

Pentaho can natively connect to Hadoop in the following ways:

  • HDFS – Input and output directly to the Hadoop Distributed File System.
  • MapReduce – Input and output directly to MapReduce programs.
  • HBase – Input and output directly to HBase, a NoSQL database optimized for use with Hadoop.
  • Hive – A JDBC driver that enables interaction with Hadoop via the Hadoop Query Language (HQL), a SQL-like query and data description language (DDL).

Multi-Threaded Engine for Faster Execution

The Pentaho Data Integration engine is multi-threaded, with each step in a job executing on one or more threads. Multi-core processors running on each data node of the cluster are fully leveraged, eliminating the need for specialized, multi-threaded programming techniques.

In addition, the Pentaho Data Integration engine executes as a single MapReduce task, instead of the multiple tasks typical of machine-generated or hand-coded programs and Pig scripts. As a result, Pentaho MapReduce jobs typically execute many times faster than other methods.

The table below compares performing common Hadoop tasks using traditional MapReduce programming skills and Pentaho’s visual interface:

MapReduce programming Vs. Pentaho’s visual interfaces for ingesting, manipulating and extracting data

Intuitive Data Orchestration Interface

Data orchestration is made easy via Pentaho’s library of graphical job flow components, which include conditional checking, event waiting, execution and notification.

Pentaho's Graphical Job Flow Components

Together, these components can be combined to enable visual assembly of powerful job flow logic across multiple jobs and data sources.

Pentaho provides graphical drag-and-drop components for Hadoop ecosystem projects such as Sqoop and Oozie, drastically reducing the amount of time needed to use these powerful bulk-data load and workflow utilities:

  • Sqoop – A tool designed for efficiently transferring bulk data between Hadoop and structured data stores, such as relational databases.
  • Oozie – An open-source workflow/coordination service that manages data processing jobs for Hadoop.

Deep Support for Hadoop

Pentaho fully supports the leading Hadoop-based distributions and supports native capabilities, such as MapR’s NFS high performance mountable file system. Several distributions of Hadoop are available as open source projects and from commercial providers.

Hadoop distributions