Implementing Hadoop: 7 Common Mistakes and How to Avoid Them

Introduction

It’s no secret that Hadoop comes with inherent challenges. Business needs, specialized skills, data integration, and budget are just a few things that factor into planning and implementation. Yet, despite all the due diligence, a large percentage of Hadoop implementations fail. We’d like to turn that around.

With the goal of helping organizations achieve business value from Hadoop, we sat down with members of Pentaho’s consulting services and enterprise support teams to discuss their experiences with helping organizations to develop, design and implement complex big data, business analytics or embedded analytics initiatives. 

We identified 7 common mistakes made by executives and IT teams as they go through the planning and implementation process. We’ve addressed the mistakes into two parts: 1) tactical (for developers and engineers), and 2) strategic (for architects and executives).

Common Tactical Mistakes

Mistake #1:“Migrate everything before devising a plan”

Let’s say that you’ve determined that your current architecture is not equipped to process big data effectively, management is open to adopting Hadoop, and you’re excited to get started. Wonderful!

But, don’t just dive in without a plan. Migrating everything without a clear strategy will only create long-term issues resulting in expensive ongoing maintenance. With first-time Hadoop implementations, you can expect a lot of error messages and a steep learning curve. Dysfunction, unfortunately, is a natural byproduct of the Hadoop ecosystem...unless you have expert guidance.

Successful implementation starts by identifying a business use case. Consider every phase of the process – from data ingestion to data transformation to analytics consumption, and even beyond to other applications and systems where analytics must be embedded. It also means clearly determining how Hadoop and big data will create value for your business. Taking an end-to-end, holistic view of the data pipeline, prior to implementation, will help promote project success and enhanced IT collaboration with the business.

Our advice: Maximize your learning in the least amount of time by taking a holistic approach and starting with smaller test cases.  Like craft beer, good things come in small batches.

Check out our eBook: Hadoop and the Analytic Data Pipeline for more on creating a holistic approach.

Mistake #2: Assume the same skillsets for managing a traditional relational database are transferable to Hadoop

Believing you can do everything with Hadoop the way you do things with relational databases is a common mistake made by business people who are implementing Hadoop for the first time.  Like taking the “red pill” in the movie The Matrix, once you enter the new world, you can’t do things the same way.

Hadoop is a distributed file system, not a traditional relational database (RDMS). Hadoop allows IT teams to effectively store, process and distribute structured and unstructured data in sizes and types that relational databases can’t typically handle.  It offers massive scale in processing power and storage by leveraging multiple nodes of commodity hardware to crunch data in parallel.  Because Hadoop doesn’t function in the same way as a relational database, you cannot expect to simply migrate all your data and manage it in the same way, nor can you expect skillsets to be easily transferable between the two.

On the other hand, if your current team happens to lack certain skills or familiarity with Hadoop, it doesn’t mean that hiring a new group of people is inevitable. Every situation is different, and there are several options to consider.  For example, training existing employees in addition to augmenting staff might be a good option.  Filling skills gaps with point solutions may suffice in some instances, but for growing organizations looking to scale, leveraging an end-to-end data platform that is accessible to a broad base of relevant users may be the best solution in the long run.

Our advice: While Hadoop does present IT organizations with skills and integration challenges, it’s important to look for a solution with the right combination of people, agility, and power to make you successful.

Check out our Hadoop Starter Kit for additional resources.

Mistake #3. “Treating a data lake on Hadoop like a regular database”

A major misconception is that you can treat a data lake on Hadoop just like a regular database.  While Hadoop is powerful, it’s not structured the same way as an Oracle, HP Vertica, or a Teradata database, for example.  Similarly, it was not designed to store anything you’d normally put on Dropbox or Google Drive.  A good rule of thumb for this scenario is: if it can fit on your desktop or laptop, it probably doesn’t belong on Hadoop.

In a data lake, the data is there, but since the data has not been partitioned, optimized, etc. – you have what you need to build but you can’t just use it out of the box.  A data lake is like having a box of Legos – you have what you need to build a Star Wars figurine – but it’s not a figurine out of the box.  People think that the data lake is pristine, clear, and easy to find like a pristine mountain lake where you can see everything in it.  But reality, many people end up building a lake that’s 3 miles wide and 2 inches deep and is full of mud; you realize it’s not what you thought you were building.

As your organization scales up data onboarding from just a few sources going into Hadoop to hundreds or more, IT time and resources can be monopolized, creating hundreds of hard-coded data movement procedures – and the process is often highly manual and error-prone.  A properly developed data lake will:

  • Reduce IT time and cost spent building and maintaining repetitive big data ingestion jobs, allowing valuable staff to dedicate time to more strategic projects
  • Minimize the risk of manual errors by decreasing dependence on hard-coded data ingestion
  • Automate business processes for efficiency and speed, while maintaining data governance
  • Enable more sophisticated analysis by business users with new and emerging data sources

Our advice:  Take the proper steps up front, in order to understand to best ingest data to get a working data lake. Otherwise, you’ll end up with a data lake that’s more of a data swamp. Everything will be there, but you won’t be able to derive any value from it.  

Learn how Pentaho’s Metadata Injection helps organizations accelerate productivity and reduce risk in complex data onboarding projects by dynamically scaling out from one template to hundreds of actual transformations.

Mistake #4. “I can figure out security later”

For most enterprises, protecting sensitive data is top-of-mind, especially after recent headlines about high profile data breaches. And if you’re considering using any sort of big data solution in your enterprise, keep in mind that you’ll be processing data that’s sensitive to your business, your customers and your partners.  You know security is important in the long run, but is it important to consider it before you deploy? Absolutely!

You should never, ever, expose the credit card and bank account information, social security numbers, proprietary corporate information and personally identifiable information of your clients, customers and employees. Protection starts with planning ahead, not after deployment.

Our advice: Address each of the following security solutions before you deploy a big data project:

  • Authentication: Control who can access clusters and what they can do with the data
  • Authorization: Control what actions users can take once they’re in a cluster
  • Audit and tracking: Track and log all actions by each user as a matter of record
  • Compliant data protection: Utilize industry standard data encryption methods in compliance with applicable regulations

  • Automation: Prepare, blend, report and send alerts based on a variety of data in Hadoop
  • Predictive analytics: Integrate predictive analytics for near real-time behavioral analytics
  • Best practices: blending data from applications, networks and servers as well as mobile, cloud, and IoT data

Stay ahead of the curve. Watch this cyber security video to learn more.

Common Strategic Mistakes

Mistake #1. “The HiPPO knows best. No strategic inquiry necessary”

HiPPO is an acronym for the "highest paid person's opinion" or the "highest paid person in the office." The idea is that HiPPOs are so self-assured that they tend to dismiss any data or the input of lower-paid employees that disagree with the correctness of their intuitions. Trusting one’s gut rather than data may work occasionally, but Hadoop is complex and requires strategic inquiry to fully understand the nuances of when, where, and why to use it.

To start, it’s important to understand what you’re trying to achieve from a business perspective, who will benefit from this investment, and how the spend will be justified. In fact, most big data projects fail because the business value is not being achieved.

The true business value of Hadoop is determined by the nature of your data problem. Do you have a current or future need for big data? A desire for self-service data preparation? Or a need to embed analytics into your portals or applications? Are you spending most your time preparing data, as opposed to visualizing it?

Once a data problem has been established, the next step is to determine whether or not your current architecture will help you achieve your big data goals.  If exposure to open source or unsupported code is a concern, it may be time to explore commercial options with the support and security you need. The same can be said if you plan to embed software within your company for company-wide analytics that enable users to get what they need without having to learn and switch to another application or ask IT for help all the time.  

Our advice: As Teddy Roosevelt once said, "The best executive is the one who has sense enough to pick good men to do what he wants done, and self-restraint enough to keep from meddling with them while they do it." You hired talented people for a reason; listen to them.  Once a business need for big data has been established, determine who will benefit from the investment, how it will impact your infrastructure, and how the spend will be justified.  Also, try to avoid science projects; they tend to become technical exercises with limited business value. 

Learn more about getting started with big data.

Mistake #2: “Bridge the skills gap with traditional ETL processes” 

Assessing the depth and potential challenges associated with the dreaded “skills gap” is common stumbling block for many organizations considering how to tackle the ETL challenges associated with big data.  Implementing Hadoop requires highly specific expertise, but there just aren’t enough IT pros with Hadoop skills to go around. On the other hand, some programmers proficient in Java, Python, and HiveQL, for example, may lack the experience to optimize performance on relational databases. This scenario may be problematic when Hadoop and MapReduce are used for large scale traditional data management workloads such as ETL.

Some emerging point solutions are designed to address the skills gap, but they tend to support experienced users rather than elevate the skills of those who need it most.  If you’re dealing with smaller data sets, you may consider hiring people who’ve had the proper training on either end, or work with experts to train and guide your staff through implementation. But if you’re dealing with extremely large amounts of data – hundreds of terabytes of data, for instance – then you’ll likely need an enterprise-class ETL tool as part of a comprehensive business analytics platform, like Pentaho. Pentaho took a general-purpose ETL framework and made it so that it runs natively in the context of a Hadoop grid or cluster from prominent distributors such as Cloudera, MapR, Hortonworks, Amazon and others.

Our advice: Technology only gets you so far.  People, experience, and best practices are the most important drivers for project success with Hadoop.  When considering an expert or a team of experts as permanent hires or consultants, you’ll want to consider their experience with “traditional” as well as big data integration, the size and complexity of the projects they’ve worked on, the companies with whom they worked with, and the number of successful implementations they’ve done.  When dealing with very large volumes of data, it may be time to evaluate a comprehensive business analytics platform that’s designed to operationalize and simplify Hadoop implementations.

At Pentaho, we have over 150 certified consultants and the combined expertise of Pentaho as well as Hitachi big data professionals to provide comprehensive support from figuring out what you need to ensure a smooth implementation and ongoing success.  

Learn more about Pentaho consulting services

Mistake #3. “I can have a small budget and get enterprise-level value”

The low-cost scalability of Hadoop is one reason why organizations decide to use it.  But many organizations fail to factor in data replication/compression (storage space), skilled resources, and overall management of big data integration of your existing ecosystem.

Remember, Hadoop was built to process a wide variety of enormous data files that continue to grow - quickly. And once data is ingested, it gets replicated!  For example, if you have 3TB you want to bring in, that will immediately require 9TB of storage space, because Hadoop has built-in replication (which is part of the parallel processing that makes Hadoop so powerful.)  

So, it’s absolutely essential to do proper sizing up front.  This includes having the skills on hand to leverage SQL and BI against data in Hadoop and to compress data at the most granular levels.  While you can compress data, it’s important to note that data compression affects performance. The compression of data also needs to be balanced with performance expectations for reading and writing date.  Also, storing the data may cost 3x more than what you’ve initially planned.  

Big data does offer big business advantages, but unexpected costs and complexity will present challenges for organizations that don’t properly plan and prepare. You must figure out how you’re going to balance data growth rates with the cost of scale prior to implementation.

Our advice: Understand how storage, resources, growth rates, and management of big data will factor in to your existing ecosystem before you implement.

Learn more about how to operationalize Hadoop.

About Pentaho:

Pentaho is a leading data integration and business analytics company with an enterprise-class, open source-based platform for diverse (big) data deployments. Pentaho’s unified data integration and analytics platform is comprehensive, completely embeddable and delivers governed data to power any analytics in any environment.

Tag: 
Category: