Get Streaming with Pentaho 8.0

Today we are excited to announce the general availability of Pentaho 8.0. This major release delivers customers greater enterprise scalability and productivity, while protecting them from many of the challenges posed by the rapidly changing data landscape.

One way we’ve achieved this is through improved connectivity to streaming data sources. The constant generation of data from cloud, IoT devices and social applications presents an enormous opportunity for data-driven organizations to drive greater profitability, accelerate product and service innovation, and deliver exceptional customer experiences. As enterprises look to better manage the growing, varied and fast data, it's critical to process all the data as it happens and react immediately if necessary.

Batch data processing frameworks, such as MapReduce, provide a “rear view mirror” into the company’s past. Now, with data-on-air (a.k.a. real time processing), users are able to gain more value to make time-critical business decisions than they previously could with data-at-rest (a.k.a. batch processing). With 8.0, consuming and blending stream data to the data lake is a necessity, as is processing all that data as it arrives for time-critical insight. This is a big competitive advantage for enterprises.

Stream data processing is going beyond event detection and alerting. Currently, many big data applications require multiple streams of data to be correlated, blended and machine analyzed on the wire in conjunction with non-realtime contextual data. From there, writing to the data lake for historical analysis is still required.

There are many ways of processing streaming data depending on the use case and type of data: Data Transfer, Batch Processing Real-time data, Real-time Processing, and a combination of them. There are also many architectural and functional design patterns for stream processing, including Lambda and Kappa. 

Fig 1: Stream Data Processing Architecture

Many of the stream data processing architecture frameworks are converging on Kafka to input and output stream data to and from the analytic systems. In IoT frameworks, inbound Kafka clusters are sometimes replaced by MQTT brokers or native queuing mechanisms.  

In generic stream data processing, Spark streaming is used. This blends well with Spark's core batch processing capability.  

  • Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Kafka is distributed, scalable, real-time publish/subscribe messaging system. Allows an architecture that decouple data producers from data consumers/processors which makes it highly scalable message bus to integrate for stream data ingestion and distribution.

Pentaho 8.0 introduced new capabilities within the same framework and toolset, including the ability to:

  • Visually ingest and produce data from/to Kafka using two new steps – Kafka Input and output
  • Process micro-batch chunks of data using either a time-based or a message size-based window
  • Switch processing engines between Spark (Streaming) or Native Kettle
  • Use hardened data processing logic steps to process data

Pentaho co-exists and leverages these big data technologies to provide a seamless visual data processing experience, thus lowering the barrier for enterprises to innovate on a new set of stream data processing applications.

Fig 2: Combined Data Processing Using Pentaho

These data processing applications will able to consume stream data from various sources—including  Kafka—to be cleansed and blended with contextual data at scale using Spark (streaming and core)—or other streaming technologies—and then persist that blended data to the data lake. Alternatively, it can then push it to another stream data sync for downstream processing. Either way, this can be done in real-time, continuously. 

Fig 3: How to Process Stream Data in Pentaho 8.0

With this new addition of capabilities, Pentaho offers an end-to-end business analytics platform that uses visual Spark and Kafka to capture and deliver millions of events per second with IoT connectivity and high-speed messaging. Instantly ingest, process, and deliver insight to real-time applications and fast NoSQL data stores with big data processing.

Pentaho takes advantage of Kafka, Spark & Spark Streaming but completely makes these technologies transparent through its adaptive execution layer [Blog post link] to leverage its full power and scale on-premises or in the cloud.

We're very excited about the latest edition of Pentaho 8.0 and hope you will be too! Pentaho 8.0 is now available.  Also visit Pedro Alves' blog to find out more about how the version improves connectivity to streaming data sources.

You can also get started NOW with Pentaho 8.0 with a free trial download.


Rakesh Saha

Senior Product Manager