Pentaho Labs

Pentaho Labs drives innovation in big data integration and analytics through incubation of new breakthrough advanced technologies. Innovations include advanced visualization, big data templates, real time and predictive analytics. Visit often to see what is cooking in the labs next.

Big Data Innovations That Bring Greater Flexibility

Pentaho with Spark

Apache Spark™ - Engine for Large-Scale Data Processing

Pentaho leads the way in Big Data by enabling developers to process data at scale. Recent work investigating and prototyping numerous potential use cases where Spark can be leveraged within a customer environment has shown positive results.  The labs team continues to future-proof big data investments with the incubation of new breakthrough technologies, as is the case with Spark.  

Process Data in Memory

With Pentaho and Spark, customers can process data upwards of 100 times faster than Hadoop.  This means that customers can more easily meet SLAs with customers by delivering insights faster than ever before.

Current Use Cases being Prototyped:

Queries directly on Spark SQL

  • Enable access to Spark SQL in Pentaho Data Integration through JDBC
  • Use Spark as a data source for pixel perfect reports

Orchestrate Spark Parallel Execution

  • Enable the execution of Spark Applications from Pentaho Data Integration

Visual Spark ETL

  • Enable Pentaho Data Integration transformations to be executed at scale inside of Spark

Interactive Spark

  • Operationalize Spark Scala Scripts

Spark Streaming

  • Enable realtime feeds for:
    • Complex Event Processing
    • Alerting
    • Monitoring


Pentaho with Storm on YARN

Pentaho with Storm on YARN for Real Time Analytics

Pentaho continues to lead the way in big data enabling developers to process big data analytics in real time, speeding critical decisions based on time-sensitive data with Pentaho Data Integration (PDI) with Storm on YARN.

The YARN-based architecture of Hadoop provides a more general processing platform not constrained to MapReduce. Pentaho Labs continues to future-proof big data investments with the incubation of new breakthrough technologies such as YARN and Storm.

This player has embed/social functionality - only use for ungated content

Developers can now immediately be productive with one of the most popular distributed streaming processing systems today. Existing Pentaho transformations can be executed as real-time processes via Storm - including those used in Pentaho MapReduce. This powerful combination brings data to business users immediately without delay or overhead of designing additional transformations.

Process Data as it Arrives

With Pentaho processing data begins when it arrives from the source and delivers valuable data sets immediately. Up to the second insights are available for key business metrics delivering real-time dashboards, reports, or intermediate data sets to be used by existing applications.

Leverage Existing Data Transformations

Many customers have long running batch Pentaho jobs that run within Hadoop via MapReduce. Pentaho for Storm compliments these allowing developers to reuse existing transformations to process data immediately. Both batch and real-time workflows are powered by Pentaho. Existing developers can build upon years of knowledge to learn the most from their data, instantly.

Pentaho with Storm allows developers to reuse their knowledge and components to process data differently. Deliver data when it’s needed - all in a familiar environment.

Next Steps

Today, Pentaho with Storm can process many of existing transformations but this is still an innovation in incubation. To learn more visit:


Pentaho Advanced Visualizations

Visualize Blended Big Data for Deeper Insights

As a leader in Big Data, Pentaho helps visually explore more sophisticated data sets of structured and unstructured data, delivering greater insights through visual representations of blended data. Compared to traditional data exploration, Pentaho enables a new level of understanding from data that requires more than two-dimensional visualizations.

From healthcare to shipping logistics to super computing, the relationships between data and the meaning behind the data requires new types of visualizations to extract business insights. The challenge for many analytics vendors is that rigid, proprietary visualization frameworks prevent the ability to integrate with the new growing open frameworks. Embracing the open model, Pentaho is able to quickly innovate and ultimately provide support for new ways of visualizing big data.

Advanced Visualization Gallery

The Pentaho Labs engineers experiment with advanced visualizations using open visualization libraries, unlocking possibilities for new use cases. Below are some examples of current visualizations being developed within Pentaho Labs that are designed to solve broad, as well as some very specific analytic needs.

Polygonal Geomap – Multidimensional Visualization Map

Advanced maps that represent political, environment, property, or business-based boundaries that use custom color gradients to aid in understanding. 

Sankey – Relationship Diagram

The Sankey diagram shows how measures or values from one dimension or level of data flow into, and contribute to, another one. 

Wafer - Flow of Information Through a Circuit

Industry specific visualizations are available through Pentaho’s open architecture.  This visualization was created to show quality levels of integrated circuits printed on a silicon wafer. New visualizations like this can be created as plugins in Pentaho Business Analytics.

Blended GIS - Pentaho and 3rd Party Integrated Geomap

The above visualization shows an ESRI map displayed with Pentaho’s mapping bubbles to indicate additional values.