Educational Hub

What is Apache Spark?

Supercharged Data Retrieval for Complex Queries

Apache Spark is an open source cluster computing framework that enables rapid data processing. It was created to better handle successive queries of large datasets. Spark can be run in-memory or on disk, and is able to execute both batch and real-time data workloads.

The “secret” to Spark’s success is partly due to how it accesses information. In-memory computing stores data in RAM, allowing for quicker retrieval than searching on disk. However, Spark is still among the fastest engines for disk processing, and it currently holds the world record for large-scale on-disk sorting.

Spark addresses the challenges caused by Hadoop’s reliance on single-operation MapReduce commands. The latter approach requires developers to line up a series of jobs in order to complete an iterative analytics query. Hadoop introduces significant lag time since each operation is dependent upon the one preceding it.

Spark’s ability to repeatedly tap into RAM and cache results means it can break queries into multiple computations for simultaneous processing. This is ideal for big data analytics and machine-learning applications.

Information is organized in a Resilient Distributed Dataset (RDD), essentially a collection of values. Operations are performed on RDDs in order to filter data or compute new results.

Spark is composed of four libraries: Spark SQL, Spark Streaming, MLlib and GraphX. Together, they provide the power necessary to work with real-time data flows and complex algorithms.

Meeting the Demands of Bigger Data

Apache Spark has grown in popularity because of the simplicity, flexibility and speed it provides. All three of these qualities are strong selling points in a big data ecosystem that demands fast insights.

Developers can write applications in Java, Scala or Python. Spark can run on a standalone cluster or on top of an existing system like Hadoop YARN. It’s able to work with a diverse list of database technologies, including HDFS, Cassandra and HBase.

In addition, Spark gracefully handles storage limitations by spilling data onto disk if the RAM gets overwhelmed. Spark was shown to perform tasks 100 times faster than Hadoop when using in-memory storage and 10 times as fast on disk. Spark is a natural choice for organizations that require high-performance data processing tools.

Pentaho Data Integration with Apache Spark brings the open source processing engine’s abilities within reach of more users by simplifying the orchestration of Spark jobs. PDI’s native support for Spark provides an intuitive interface for running high-speed data processing tasks. Pentaho is committed to improving the ease-of-use of emerging big data technologies.

Helpful Resources

Choosing the best business analytics solution can be complicated. Check out our library of helpful content including case studies, whitepapers, webinars and demos. 

See all Resources >



Related Topics

What is Big Data?

What is Data Integration?

What is Governed Data Delivery?