Predictive Analytics

The world of Data Science is growing rapidly. With more access to data sources everywhere, data scientists now have the ability to mine this data and predict consumer behavior, fraud, machine failure, etc. to really make an impact in the world. Pentaho Labs supports this through the exploration of R, Weka, and Python.

Meet the Expert

Mark Hall
Predictive Analytics Expert

“As the field of data science continues to grow outside the world of research and statisticians, it is import for our team to arm developers with a wide range of programming languages."

Mark holds a PhD in Machine Learning and is one of the original core developers of the Weka data mining software. He is an expert in all predictive analytics and is responsible for leading Pentaho’s data mining solutions.

R language is widely used among statisticians and data miners for developing statistical software and data analysis.

How Pentaho Supports R: 

  • R is part of the Data Science Pack. Pentaho offers the R Script Executor for PDI, which allows an R script to be run as part of a Pentaho Data Integration transformation, removing the burden of data preparation. 


Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes and is used to build predictive models.

How Pentaho Supports Weka:

    • Weka Scoring for PDI: This tool allows the user to “score” data as part of a PDI transformation by applying classification, clustering, and regression models constructed in WEKA
    • Weka Forecasting for PDI: Weka forecasting leverages forecasting models created in Weka’s time series analysis and forecasting environment in order to create future predictions on in-coming data within a PDI transformation

Check out how R and Weka fit into Pentaho Data Integration with the Data Science Pack.


Python is one of the most flexible and powerful open-source programming languages that allows developers to work quickly, making it more efficient to build predictive models with big data sets and integrated systems. Python emphasizes code readability, allowing programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.

How Pentaho is Innovating with Python:

    • PDI natively supports scripts written using Python
    • New plug-in for Python written models so they can be pulled into transformations

Read Pentaho Labs Python Native Integration press release, and check out Mark’s blog post: CPython Scripting in PDI.