Pentaho and Hadoop
In-Hadoop execution, from data prep to predictive analytics
The Pentaho Business Analytics platform provides Hadoop users with visual development tools and big data analytics to easily prepare, model, visualize and explore data sets. From data preparation and configuration to predictive analytics, Pentaho covers the data lifecycle from end-to-end with a complete solution to your business intelligence needs.
Visual Development for Hadoop Data Prep and Modeling
Pentaho’s visual development tools drastically reduce the time to design, develop and deploy Hadoop analytics solutions by as much as 15x, compared to traditional custom coding and ETL approaches.
- Powerful visual user interface for ingesting and manipulating data within Hadoop, and makes it easy to enrich Hadoop data with reference data from other sources.
- Option of accessing Hadoop data either directly, or through rapid visual extraction into data marts/warehouses optimized for fast response times.
- A visual tool for defining business metadata models helps developers prepare their data for analytics.
Interactive Visualization and Exploration
Pentaho enables your IT department to share interactive data visualizations and exploration capabilities, allowing business users to build dashboards and run reports on their own.
The Hadoop data integration and business analytics platform enables users to easily review data through:
- Rich visualization – Interactive, web-based interfaces for ad hoc reporting, charting and dashboards.
- Flexible exploration – View data through dimensions such as time, product and geography, and across measures like revenue and quantity.
- Predictive analytics – Use advanced statistical algorithms such as classification, regression, clustering and association rules to spot trends or challenges ahead of time.
Pentaho Instaview takes data from raw numbers to interactive visualization in only minutes. Preparation of Hadoop data for analysis is greatly simplified and automated, reducing the analytics cycle from days or weeks to minutes or hours.
Pentaho Visual MapReduce: Scalable in-Hadoop Execution
Pentaho’s Java-based data integration engine works with the Hadoop cache for automatic deployment as a MapReduce task across every data node in a Hadoop cluster, making use of the massive parallel processing power of Hadoop.
Pentaho can natively connect to Hadoop in the following ways:
- HDFS – Input and output directly to the Hadoop Distributed File System.
- MapReduce – Input and output directly to MapReduce programs.
- HBase – Input and output directly to HBase, a NoSQL database optimized for use with Hadoop.
- Hive – A JDBC driver that enables interaction with Hadoop via the Hadoop Query Language (HQL), a SQL-like query and data description language (DDL).
Multi-Threaded Engine for Faster Execution
The Pentaho Data Integration engine is multi-threaded, with each step in a job executing on one or more threads. Multi-core processors running on each data node of the cluster are fully leveraged, eliminating the need for specialized, multi-threaded programming techniques.
In addition, the Pentaho Data Integration engine executes as a single MapReduce task, instead of the multiple tasks typical of machine-generated or hand-coded programs and Pig scripts. As a result, Pentaho MapReduce jobs typically execute many times faster than other methods.
The table below compares performing common Hadoop tasks using traditional MapReduce programming skills and Pentaho’s visual interface:
Intuitive Data Orchestration Interface
Data orchestration is made easy via Pentaho’s library of graphical job flow components, which include conditional checking, event waiting, execution and notification.
Together, these components can be combined to enable visual assembly of powerful job flow logic across multiple jobs and data sources.
Pentaho provides graphical drag-and-drop components for Hadoop ecosystem projects such as Sqoop and Oozie, drastically reducing the amount of time needed to use these powerful bulk-data load and workflow utilities:
- Sqoop – A tool designed for efficiently transferring bulk data between Hadoop and structured data stores, such as relational databases.
- Oozie – An open-source workflow/coordination service that manages data processing jobs for Hadoop.
Deep Support for Hadoop
Pentaho fully supports the leading Hadoop-based distributions and supports native capabilities, such as MapR’s NFS high performance mountable file system. Several distributions of Hadoop are available as open source projects and from commercial providers.
Pentaho Customer Success with Cloudera Hadoop