Lightning-Fast Analytics

Big data comes with a host of adjacent technologies, such as MQTT, Spark, and Storm. Pentaho Labs is at the forefront of these technologies and works on experimental and customer-driven projects to optimize how data is processed and uncover the best ways to execute real-time analytics with big data.

Meet the Expert

Ken Wood
VP of Pentaho Labs

“The open source foundation of Pentaho and efforts taking place in Pentaho Labs allows us to quickly innovate with emerging technologies, so that we may remain autonomous and provide flexibility to the big data community."

Twitter: @KenWoodOnTech

As VP of Pentaho Labs, Ken Wood is dedicated to leading the organization to the next level of big data analytics innovation. His expertise comes from working with Hitachi’s advance technologies when he served as CTO of the Technical Incubation Group at Hitachi Data Systems.

Ken Wood


Message Queuing Telemetry Transport (MQTT)

MQTT is a simple and lightweight publish-subscribe messaging protocol.  It was created by IBM in 1999 and is used today in a variety of IoT environments such as smart homes, manufacturing, energy and more. The open standard protocol has an extensive ecosystem of libraries and support and requires minimal packet overhead – all distinct features that have made MQTT a leading IoT protocol. 

How Pentaho Labs is Innovating with MQTT:

  • Pentaho users can now more easily and quickly, ingest, interact, blend and react with incoming streams of IoT data.
  • Users accomplish this unique capability by pushing messages to public or private broker services, such as Amazon’s IoT Service, under a specified topic. The broker routes the messages to subscribers of specified topic(s).
  • Pentaho users can integrate IoT data with data from business systems and incorporate modern analytics to capitalize on the promise of IoT.

Pentaho Data Integration’s MQTT Subscriber input and MQTT Publisher output steps are available in the marketplace.


By effectively using Spark to manage cluster operations, large amounts of unstructured data can be processed much more quickly and efficiently. In proof of concept work by the Labs team, Spark and Pentaho have been shown to process data upwards of 100 times faster than Hadoop.

Spark-Pentaho Use Cases Being Prototyped:

  • Interactive Spark: Operationalize Spark Scala Scripts
  • Spark Streaming: Enable real-time feeds for complex event processing, alerting, and monitoring 

Storm on YARN

Storm is a distributed real-time computation system for processing large volumes of high-volume data. Storm adds reliable real-time data processing capabilities to Hadoop and other data storage systems. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.

How Pentaho is Innovating with Storm:

  • Existing Pentaho transformations can be executed as real-time processes via Storm - including those used in Pentaho MapReduce
  • Allows users to reuse existing transformations to process data immediately