Educational Hub

What is Big Data?


The concept of big data arrived in executive suites via IT and engineering departments, and those technical roots have led to some disagreement of what the term means. At its core, big data refers to high volumes of information from multiple sources being fed into data stores on a time-sensitive basis. The shorthand: lots of data coming rapidly and from many places. Big data is an important foundation of quality business intelligence, providing enough information to detect meaningful trends.

Just how “big” is this data? While most people are used to working with megabytes, gigabytes and terabytes, large companies or government agencies may have to manage petabytes (1,000 terabytes) or exabytes (1,000 petabytes). Big data management often requires multiple servers that rely upon distributed processing environment like Hadoop. Recent advances in computing power and reduced storage costs have made big data technology affordable for more companies. 

The data in question is either structured or unstructured/semi-structured. The former is highly organized (often in rows and columns), as in a database table or spreadsheet. Most traditional databases, data integration tools, and analytics technologies are suited to working with this type of information.  

On the other hand, most human communication consists of unstructured data. E-mails, documents, photos and social media posts contain valuable information, but it hasn’t been classified, ordered or cleaned up.  Data from log files, sensors, and other machine sources also tends to be semi-structured or unstructured.  These data types generally require more effort and/or different tools to prepare for value extraction.

Big data applications can store, integrate and present both structured and unstructured data/semi-unstructured, providing a comprehensive view of the business environment for easier analysis and consumption.


Consider the data that flows into companies of all sizes: information from customer transactions, payrolls, invoices and inventory systems are just some examples. The ideal big data setup allows valuable insights to be gleaned from the aggregate – a job that would be overwhelming for a human analyst.

The benefits of big data extend beyond corporate headquarters. Improving your operations also reduces pain points for customers, the people you’re in business to serve.

Big data initiatives can:

  • Provide near-real time updates on your supply chain, sales or customer sentiment
  • Reduce costs by identifying staffing inefficiencies, monitoring machine data for signs of impending failures or tracking compliance in heavily regulated industries
  • Gather large amounts of customer information and transform it into comprehensive persona profiles
  • Improve data security by identifying unusual behavior indicative of fraud or hacking

Allowing this data to simply take up space on your servers may be an expensive mistake – especially if competitors have already optimized their big data environments.

It’s a safe assumption that data volumes will continue to increase as appliances and manufacturing equipment join the mobile web as the “Internet of Things.” Big data isn’t a fad – it’s an essential component in understanding what your business is telling you.

Pentaho’s Big Data Analytics platform allows you to discover important trends in an intuitive dashboard environment. Advanced statistical analysis yields interactive reports and visualizations that are easy to understand and share. Machines can crunch numbers – Pentaho knows that your time is better spent reviewing results and planning for the future.


A simplified Cybersecurity Analysis solution through big data orchestration and streamlined data transformations allows end users like Forensic Analysts, Cybersecurity Analysts and Data Scientists to detect threats faster. Learn more about the Cybersecurity Analytics use case. 


Reduce strain on the data warehouse by offloading less frequently used data and corresponding transformation workloads to Hadoop without coding or relying on legacy scripts and ETL product limitations. Learn more more about the Optimize the Data Warehouse use case.


Blend, enrich and refine any data source into secure, analytic data sets, on demand with a Streamlined Data Refinery. Using Hadoop as a big data processing hub, Pentaho Data Integration processes and refines specific data sets. With a single click, data sets are automatically modeled, published and delivered to users for immediate visual analytics. Learn more about the Streamlined Data Refinery use case


Blend operational data sources together with big data sources to create an on-demand analytical view across key customer touch points. Gain powerful insights into buyers, brand, products and services. Learn more about the Customer 360-Degree View use case


Capitalize on the variety and volume of your big data, allowing easy access, enrichment and de-identification of data sets that can be packaged as a new data service offering for your customers. Learn more about the Monetize My Data use case


Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. Learn how Pentaho works with Hadoop.


A NoSQL database provides a way to store and retrieve data that is modeled in means other than the tabular relations used in relational databases. The data structures used by NoSQL databases (ex: key-value, graph, or document) differ from those used in relational databases, making some operations faster in NoSQL and others faster in relational databases. NoSQL databases are increasingly used in big data and real-time web applications. Learn how Pentaho works with NoSQL Databases.


An analytic database is a type of database built to store business, market or project data used in business analysis, projections and forecasting processes. It is designed to be used specifically with big data analytics and business intelligence solutions. Learn how Pentaho works with Analytic Databases.


Amazon Redshift is a hosted data warehouse product, which is part of the larger cloud computing platform, Amazon Web Services. Redshift differs from Amazon's other hosted database offering Amazon RDS by being able to handle analytics workloads on large scale datasets stored by a column-oriented DBMS principle. Learn how Pentaho works with Amazon Redshift.

Helpful Resources

Choosing the best big data solution can be complicated. Check out our library of helpful content including case studies, whitepapers, webinars and demos. 

See all Big Data Resources >

Related Topics

What is Business Analytics?

What is Data Integration?

What is Governed Data Delivery?