Educational Hub

What is Hadoop?

USING DISTRIBUTED COMPUTING TO DIVIDE AND CONQUER BIG DATA

Apache Hadoop is an open source software framework that enables large data sets to be broken up into blocks, distributed to multiple servers for storage and processing. Hadoop’s strength comes from a server network – known as a Hadoop cluster – that can process data much more quickly than a single machine. The non-profit Apache Software Foundation supports the free open source Hadoop project, but commercial versions have become very common.

By its very nature, big data stretches computing power in a way traditional architectures may not be equipped to handle. Hadoop offers a two-pronged information management solution encompassing both storage and processing.

The Hadoop cluster improves storage capacity and file redundancy by breaking up data and distributing it across multiple servers. Faster processing comes courtesy of different technology components, including MapReduce, a programming model that runs a job on segments of data before combining answers for the final results.

Time-sensitive insights are only gained when business analytics software can quickly return understandable reports and models. In some circumstances, it would take significantly longer to complete advanced statistical analysis and modeling without Hadoop to effectively shoulder the workload.

Hadoop is rarely visible to business users. Instead, it works in the background to store and process the information required by business analytics software to do its job. Even if this is the first you’ve heard of Hadoop and MapReduce, you may have already benefited from both.

REAL PROCESSING POWER FROM AN OFF-THE-SHELF INFRASTRUCTURE

Information is a critical resource that can grow at an exponential rate. Gigabytes become petabytes, and petabytes can expand into exabytes. All this data needs to be stored somewhere for quick recall and processing. Hadoop is a form of resource management for the data-driven company.

HADOOP LIKES ALL KIND OF DATA

One of Hadoop’s main selling points is its flexibility. Any data format can be stored in the Hadoop framework – including unstructured information from phone calls, emails or social media comments. No matter what queries you eventually run, all your data lives in one repository.

HIGH PERFORMANCE AT A LOW COST

In addition, Hadoop works well on low-cost commodity servers, eliminating the need for racks of high-end machines to process data. This also makes it easy to increase performance by simply adding another server (known as a node). Distributed processing has an additional advantage. Should a machine unexpectedly go offline, Hadoop automatically assigns its data load to the other servers.

YOUR DATA ISN’T GOING ANYWHERE

Data compartmentalization doesn’t end there. The Hadoop Distributed File System (HDFS) prevents the loss of information by storing duplicate segments of every file on other servers. If a file is damaged or deleted, a clean copy can be recovered from these disparate pieces.

Helpful Resources

Choosing the best business analytics solution can be complicated. Check out our library of helpful content including case studies, whitepapers, webinars and demos. 

See all Resources >



Related Topics

What is Big Data?

What is Data Integration?

What is Governed Data Delivery?