Big Data is complex. The technologies in Big Data are rapidly maturing, but are still in many ways in an adolescent phase. While Hadoop is dominating the charts for Big Data technologies, in the recent years we have seen a variety of technologies born out of the early starters in this space- such as Google, Yahoo, Facebook and Cloudera. To name a few:
- MapReduce: Programming model in Java for parallel processing of large data sets in Hadoop clusters
- Pig: A high-level scripting language to create data flows from and to Hadoop
- Hive: SQL-like access for data in Hadoop
- Impala: SQL query engine that runs inside Hadoop for faster query response times
It’s clear, the spectrum of interaction and interfacing with Hadoop has matured beyond pure programming in Java into abstraction layers that look and feel like SQL. Much of this is due to the lack of resources and talent in big data – and therefore the mantra of “the more we make Big Data feel like structured data, the better adoption it will gain.”
But wait, not so fast---->you can make Hadoop act like a SQL data store. However, there are consequences, as Chris Deptula from OpenBI explains in his blog, A Cautionary Tale for Becoming too Reliant on Hive. You are forgoing flexibility and speed if you choose Hive for a more complex query as opposed to pure programming or using a visual interface to MapReduce.
This goes to show that there are numerous areas of advancements in Hadoop that have yet to be achieved – in this case better performance optimization in Hive. I come from a relational world – namely DB2 – where we spent a tremendous amount of time making this high-performance transactional database – that was developed in the 70’s - even more powerful in the 2000s, and that journey continues today.
Granted, the rate of innovation is much faster today than it was 10, 20, 30 years ago, but we are not yet at the finish line with Hadoop. We need to understand the realities of what Hadoop can and cannot do today, while we forge ahead with big data innovation.
Here are a few areas of opportunity for innovation in Hadoop and strategies to fill the gap:
- High-Performance Analytics: Hadoop was never built to be a high-performance data interaction platform. Although there are newer technologies that are cracking the nut on real-time access and interactivity with Hadoop, fast analytics still need multi-dimensional cubes, in-memory and caching technology, analytic databases or a combination of them.
- Security: There are security risks within Hadoop. It would not be in your best interest to open the gates for all users to access information within Hadoop. Until this gap is closed further, a data access layer can help you extract just the right data out of Hadoop for interaction.
- APIs: Business applications have lived a long time on relational data sources. However with web, mobile and social applications, there is a need to read, write and update data in NoSQL data stores such as Hadoop. Instead of direct programming, APIs can simplify this effort for millions of developers who are building the next generation of applications.
- Data Integration, Enrichment, Quality Control and Movement: While Hadoop stands strong in storing massive amounts of unstructured / semi-structured data, it is not the only infrastructure in place in today’s data management environments. Therefore, easy integration with other data sources is critical for a long-term success.
The road to success with Hadoop is full of opportunities and obstacles and it is important to understand what is possible today and what to expect next. With all the hype around big data, it is easy to expect Hadoop to do anything and everything. However, successful companies are those that choose combination of technologies that works best for them.
What are your Hadoop expectations?
- Farnaz Erfan, Product Marketing, Pentaho