Turning Your Data Lake into a Streamlined Data Refinery

It’s been over five years since Pentaho’s CTO, James Dixon coined the now-ubiquitous term data lake. His metaphor contrasted bottled water which is cleansed and packaged for easy consumption with the natural state of the water source – unstructured, uncleansed, and un-adulterated. The data lake represents the entire universe of available data before any transformation has been applied to it.

Data isn’t compromised by giving it undue context in order to fit it into existing structures, which could potentially compromise its utility to your business. You can store data at low cost and you can process it at scale. PWC, Google, and Facebook agreed with his approach.

I won’t test my readers’ patience by further extending the metaphor; suffice it to say that James did not intend for this to be an end, but only a means to an end. The data lake, its concomitant hardware, software, and skills are key elements of any agile business.

The world has changed, but it’s still recognizable

Everything still begins with the synthesis and analysis of data.

So let’s start by asking, “What kind of business challenges are data challenges?” The short and unsurprising answer is that they’re all data challenges. There may also be matters of operational change, business process, compliance, and so on – it still begins by knowing where you are today (Enterprise Data Warehouse, Data Mart), knowing the state of the world around you (externally available data such as social media) and sourcing the ever-richer set of data within your organisation (IoT) to make predictions, improve your operations, and design new products for your consumer.

Towards the Streamlined Data Refinery

One technique that we’ve been pioneering at Pentaho is the Streamlined Data Refinery as a way of providing real-time, scalable exploration capabilities on any dataset without complex coding by end-users or designing your data integration processes in advance.

It’s way of turning your data lake into a structured set of information ready for examination by users – even if they aren’t IT pros, seasoned Java experts, or big data specialists… think of it as automated cleansing and bottling of the entire lake without the development overhead of your Enterprise Data Warehouse.

The streamlined Data Refinery also has some distinct advantages over other approaches to data lake analytics:

  • A highly interactive and high-performance user experience for exploration
  • An intuitive, guided interface that can be extended to large production user bases
  • An architected and governed process for on-demand data integration behind the scenes

Is this only for Big Data?

Of course not. This is for any data designed to be explored without the overhead of designing an Enterprise Data Warehouse.

Maybe you need to build Big Data capabilities. Maybe the maturity model of your business just isn't to the point of being able to manage complex data science. That does not mean that you can afford to ignore the realities taking shape around you. You might consider a small, low-risk project like enabling a new Hadoop cluster in the cloud and using it as a processing platform for data exploration.

With the Pentaho Streamlined Data Refinery, your time to value is shortened, costs are dramatically diminished, the skills gained are invaluable, and the capabilities to gain new insights are potentially transformative to your business.

If you’ll excuse the hyperbole, if you’re not looking at all of your data, you’re not looking at all of your business.

Try it for yourself

If you’re not a believer yet, I suggest you try out our online sample Streamlined Data Refinery application, BioMe (screenshot below).

bio me screenshot

BioMe is a fictitious company that provides health and fitness monitoring devices to consumers to improve well-being. In order to reach new markets and monetize existing data assets, BioMe decided to aggregate and repackage its extensive biometric sensor data in order to deliver value-added analytics to research institutions and healthcare companies.

In this application, which is built on the Pentaho platform, you can both view high-level health trends by country and select custom data sets for drill-down analysis and visualization. The ability to dive deeper into biometric data on the fly is fueled by Streamlined Data Refinery. Try it for yourself here: Pentaho BioMe App.

Wael Elrifai
Director of Enterprise Solutions & Big Data Guru at Pentaho

Dilbert image courtesy of the comic genius Scott Adams


Wael Elrifai

Sr. Director of Enterprise Solutions, EMEA & APAC