Pentaho + MQTT = IoT

Gartner predicts that more than half of major new business processes and systems will incorporate some element of IoT by 2020. And yet, through 2018, 75 percent of those IoT projects will take up to twice as long as planned.

For this reason, there’s a growing demand from companies to more easily and quickly ingest, interact, blend, and react with incoming streams of IoT data. To address this demand, today Pentaho announced a Labs integration with MQTT, a popular machine-to-machine IoT transport protocol, to act as the connecting link between physical devices and the data integration process.

Pentaho’s open source foundation and community has enabled my team at Pentaho Labs to innovate early with emerging IoT technologies, such as MQTT, so that enterprise organizations can be prepared for a more connected future.

Taking advantage of Pentaho with MQTT is a unique opportunity for users to advance their big data architectures, integrate with complex data, and incorporate modern analytics to capitalize on the promise of connected devices and machine-generated data. Ultimately, it will enable users to future-proof their IoT deployments without risk.

In this blog, Mark Hall and I go in depth about the what, why, where and how about MQTT and Pentaho.

Who, or more correctly, what uses MQTT for Machine-to-Machine (M2M) communication and other Internet of Things (IoT) applications and solutions? While the many uses for and of MQTT is much too extensive to describe in this blog, today Pentaho Labs announced that Pentaho can now be a direct and integral part of this important IoT architecture and protocol. Two of the more popular MQTT examples include Facebook’s Messenger App and St. Jude Medical for the remote monitoring of patient implants. If you search around, you’ll find MQTT as the transport for smart power meters, pipeline monitoring solutions, transportation systems, factory equipment, retail systems and much, much more. IBM created MQTT in 1999 and is now an open standard with an extensive ecosystem of libraries and support.

MQTT stands for Message Queuing Telemetry Transport. While there are several reasons MQTT is popular for IoT applications, we can touch upon a few. It is a publish/subscribe architecture with a simple and lightweight messaging protocol designed for constrained devices, and low bandwidth, high latency or unreliable networks. The payload data in an MQTT message is not used or processed by the protocol; it is just “moved” as is.

There are three main elements to the MQTT architecture – the Publisher, the Subscriber, and the Broker. The diagram in Figure 1 shows the high-level architecture of MQTT. The Broker is a special part of the MQTT design and is not provided by Pentaho. We like to think of the MQTT Broker as similar to Hadoop; Pentaho connects and integrates with Hadoop, but we don’t provide Hadoop. 

 High Level Diagram of MQTT Architecture

It is important to point out that this initiative is not formally supported by Pentaho, and there are no current plans on the roadmap to support MQTT at this time. Pentaho Labs was introduced to formally test out new ideas and explore emerging technologies in the marketplace with the current community/user base as they go through the maturation process. Historically, Pentaho has forward thinking users, so we hope you will download and would appreciate any feedback or comments on this new plugin.

 The new MQTT steps shown in the Input and Output design palette.

Users can download the MQTT Plugin from the Pentaho Marketplace or directly from the Marketplace feature in PDI (automatic download to install).

After installation, there will be two new steps in PDI’s Input & Output categories – “MQTT Subscriber” in the Input category and “MQTT Publisher” in the Output category. These new steps connect you to a MQTT Broker and the world of IoT. You can now develop IoT-based applications and solutions by graphically programming your transformations and jobs to create (publish) IoT data streams or connect (subscribe), collect and process IoT data streams. Figure 2 shows the location of these new steps in PDI.

Before you start using these MQTT steps, you will need access to a running MQTT Broker. You can install a local MQTT Broker directly on your computer or you can install a MQTT Broker on your local area network, but here we describe how to use public broker services on the internet. There are several MQTT public broker services on the internet for experimentation and testing, some free, some like Amazon’s IoT Service are initially free for the free tier level, then you pay by the message, while others you pay for. The public MQTT brokers we have experimented with include:

  • tcp:// – free!
  • tcp:// – free!
  • tcp:// – free!
  • ssl:// – which is initially free, so be careful

We have also experimented with the most popular MQTT broker, called Mosquitto, on a local area network, both as a virtual machine and in a docker container. The MQTT brokers are accessed with a URL in the format tcp://, typically. This is the standard URL format and 1883 is the default port. These payload messages are moved in the clear. You can also use ssl://, for a secure connection. In order to access AWS’ IoT service you need to use ssl and port 8883, along with other information for accessing their broker service. With any ssl URL, you will need the associated public and private keys, and certificates – these are provided by AWS, and we will discuss this aspect in a later blog.

Enough about the broker. The other two major components to the MQTT architecture, the Publisher and Subscriber, are used to communicate with broker(s). Publishers push messages to the broker under a specified topic and the broker “routes” the messages to subscribers of specified topic(s). This is accomplished by publishing a data payload message, along with a topic, and then subscribing to the topic in order to receive the data message payload.

In the larger scheme of things, we like to describe the MQTT Subscriber as the “data source connector” for PDI, similar to the other data source connectors used when connecting to data sources in PDI. By using the MQTT Subscriber step, PDI can connect and collect data from the world of IoT apps that use MQTT. On the same lines, the MQTT publisher can be described as the “data target connector” step within PDI.

Let’s describe setting up a simple MQTT application that takes some data, packages it up into a stream of messages, publishes the messages to a publically available broker on the internet as a topic, then subscribe to the topic and receive that data as a stream of messages. The way this is setup is by running a publishing job and transformation with PDI on a Raspberry Pi that is configured to publish messages forever.

This simple example and configuration is shown here in Figure-3.

 The configuration of this demonstration showing a Raspberry Pi running PDI and publishing a data stream to a public broker, and another PDI transformation connecting to the public broker and subscribing to the message stream.

The publishing side of this configuration creates and collects some data and assembles this data into a message, then publishes the message as a topic, and then repeats the whole process.  The step descriptions are:

  1. Using PDI’s “Generate Random Credit Card Number” step generate a random number.
  2. Generate a MD5 hash of the random number.
  3. Get the device’s system time as a timestamp and the device’s IP address.
  4. Convert all this data in to a csv record as “stringMessage”
  5. Insert the topic “pen/number/test”
  6. Write to the log for viewing and the MQTT Publisher step pushes the message to the public MQTT broker
  7. Using a Job, add a 2 second delay between messages and set the Job to repeat indefinitely.

On the subscription side of this configuration, the transformation connects to the public MQTT broker and subscribes to the topic to receive the message sent by the publisher. The step descriptions are:

  1. The MQTT Subscriber step connects and receives the message for the topic “pen/number/test”
  2. Write to the log to view the incoming messages.
  3. Output the messages to a cvs file to store the messages.
  4. The MQTT Subscriber step has a setting to execute indefinitely.

The reason there is a Write to Log step split from the MQTT Publisher step is to monitor the incoming data stream. In a real application this could be removed for efficiency. A screenshot of the running MQTT Subscriber transformation is shown in Figure-4.

 Screenshot of the subscriber transformation collecting the streaming data  and the logs displaying the incoming messages.

The message format is constructed with the device’s timestamp, a 16-digit random number, the MD5 hash of that number and the device’s IP address, all tilde “~” separated. While not necessary, the topic (pen/number/test) is also included in the log output.

For completeness, a photo of the publisher executing on a Raspberry Pi is shown in Figure 5 (please excuse the messy desk).

 Photo of the publisher job running on only one of the Raspberry Pis shown.

For this specific demonstration, let’s look at how the MQTT Publisher and Subscriber steps are set up. The other steps used in this demonstration are fairly straight forward, however, in the follow-on MQTT blogs, we’ll dive deeper into more robust transformations. So let’s take a look at how the MQTT Publisher step is configured in PDI in a test environment. Figure 6 shows the dialog for the MQTT Publisher step.

  • Enter the MQTT step’s dialog.
  • Enter the broker’s URL (We’re using tcp://
  • For the “Topic name” either pull the topic from an incoming field or enter it directly.
  • In this case, the topic is “pen/number/test”, which is entered directly.
  • For the “Incoming Message field”, again it can come from an incoming field or entered directly. In this case “stringMessage” is entered directly.
  • Just about any “Client ID” can be entered. It must be unique to this instance of any MQTT client.
  • For this demonstration, we can leave the “Connection timeout and “QoS” at their defaults of 30 seconds and 0 respectively.

 Configuring the MQTT Publisher step through its dialog.

That’s it! This MQTT Publisher is ready to connect to the broker and start pushing out its messages.

For the MQTT Subscriber step used in this demonstration, we’ll need to configure two tabs of information in the step’s dialog.

  • To setup the MQTT Subscriber step, open the dialog for the step and start in the “General” tab.
  • As with the MQTT Publisher step, enter the broker’s URL – tcp://
  • For this example, use the default settings for the “Connection timeout” and “Keep alive interval” of 30 seconds and 60 seconds respectively.
  • Note that we use a different value ( “MQTT_1234”) for the “Client ID”. This value must be unique for each instance connected to the same broker.
  • Continue to use the default “QoS” value of 0
  • The “Execute for” setting is a timer value in seconds. Entering a value of 0 will make the step run indefinitely.
  • In the “Topics” tab, you can define the data type of the incoming message. Here we have set the data type to string.
  • The “Topic” can be set from the value of an incoming field, or by specifying a static value. In our example it is a static value, and is the same as the one used in the Publishing job, i.e. “pen/number/test”.

 Configuring the MQTT Subscriber  step through its dialog

At this point, it doesn’t make much difference which task is started first. If the Subscriber task is started first, it won’t do much until the Publisher task is running. Once both tasks are executing, you should observe the data streams pouring into the Subscriber task.

This introduction and simple example of the new MQTT plugin for PDI opens up the world of IoT to the Pentaho user. The use cases for MQTT and Pentaho is already being explored and experimented with within Hitachi, as there are a few groups working with early access to this plugin on some cutting edge IoT solutions. Stay tuned for more on this topic.


Ken Wood

VP of Pentaho Labs