Did You Jump on the Hadoop Bandwagon? Did You Do it for the Right Reasons?
I recently saw a statistic that only 15% of Hadoop projects are in production. So that would mean that 85% of the Hadoop projects are not yet in production, right? Hmmm…could that be because Hadoop is not the right choice for all data problems?
When the only tool you have is a hammer, everything looks like a nail.
As a builder of data warehouses I have been confounded by the proliferation of Hadoop and its ancillary “big data” products, into the Data Warehouse ecosystem. Don’t get me wrong, Hadoop does have its place for certain use cases. And there are environments with such high talent density and capacity that they can handle the complexity without blinking an eye, for example Workday or Netflix.
Unfortunately, many people don’t get that Hadoop comes with a price. A fairly steep price. Yes, the software may be all or mostly open source, so your licensing costs will be low, and it can run on commodity machines, but the complexity and the number of tools and skills your people will have to learn and integrate into your existing environment, or you will have to pay to get (and integrate), can be EXPENSIVE. Especially if your businesses core focus does not require a lot of Java engineers.
In 20 years of database and data warehouse experience I have never needed to know Java. Not once. In 2006, I did a tutorial and learned the basics of Java. Then last year, I took a Hadoop class. Guess what, the primary programming language used with Hadoop is Java. So I did another whole set of online Java courses. Nothing wrong with Java tons of systems and applications are written in Java. Just not that many data warehouse professionals have had to learn it to accomplish their jobs.
Hadoop Responsiveness and Complexity:
But wait, you might say, there are constructs and indeed a whole ecosystem built on top of Hadoop to let you use SQL language. Sure, but at what cost and what complexity? And what about the time lag? The fundamental architecture of Hadoop and the Hadoop Distributed File System (HDFS) is divide and conquer. IF I have petabytes of data that I need to analyze, frequently this can be a good strategy. The user should not expect instant results. They submit their query to Hive which then translates the query to Java and submits it to the master machine which sends the query out to all of the slave machines, each to process a portion of the data, the process results are aggregated and then sent back to the master to send back to Hive to send back to the user. Hadoop was created for the problem of big data for less money, it is not trying to be responsive nor is it designed for responsiveness with small data sets.
So how is it that so many projects are trying to move to Hadoop?
I have two theories:
1. The Bandwagon effect or simply “peer pressure”:
Back when Hadoop and HDFS first justifiably started getting positive publicity. Someone senior, perhaps the CIO, reads an article extolling the virtues of Hadoop. He was curious so he asked about it with his direct reports, say at the director level. The directors either had or had not yet heard about Hadoop but not wanting to look bad, and perhaps interpreting curiosity as interest and intention, start researching and discussing with their direct reports. Before long the scuttlebutt around the water cooler was that the CIO wants to do Hadoop and if you want to look good, get promoted and get a bonus, you should be doing Hadoop. Or at least putting it in your budget and planning for it.
Then word got out, that ABC company was doing Hadoop projects and company XYZ’s CIO didn’t want to fall behind the competition, so he started talking to his direct reports and the whole thing repeated itself again.
At both companies there may have been people who questioned this wisdom, who knew or suspected that Hadoop was not the right tool, but like the story of the emperor’s cloths, we don’t want to be the only one who doesn’t see it. We don’t want to look stupid.
And there is a corollary concerning the story of the emperor’s clothes, it says that the kid who pointed out that the emperor was naked, did not win the Emperor’s royal scholarship that year, or any year thereafter.
2. Hot new tool (buzz word) effect:
I did hear one other explanation for why it is that some companies might be doing Hadoop projects even though their use case might not fit the “big data” profile… they do Hadoop to keep their good people. The theory is that if you don’t let your people work with the hot new tools and add the buzz words to their resumes, they will go somewhere else that will let them do just that. I’ve been ruminating on that one for a while. At first I accepted it as perhaps a necessary inefficiency for keeping good people. After thinking about it for a while and discussing it with a variety of friends and acquaintances in the industry, I’ve come to the conclusion that there needs to be some other way to engage your people to design, architect and implement an appropriate efficient solution with a minimal amount of waste.
An example from the field: working with a client recently I was the data architect on an outsourced project converting an end-user-created manual legacy process to an engineered Informatica and Oracle implementation that could be supported by the IT department. About half way through the project, one of the corporate architects came by and asked, “Why aren’t you doing this in Hadoop?” The senior ETL architect and I looked at each other, then looked at him, a little dumbfounded… Um, because we are only processing 20 gigabytes of data once a month?
An Alternative to Hadoop (for many use cases):
Until recently, I did not have a good alternative for this complexity inflating, budget killing, risky tendency to try to put everything in Hadoop. Then I attended the Denver Agile Analytics meetup and the presenter that night was Kent Graziano, the senior evangelist for Snowflake Computing. His presentation was about his experience and some techniques used to do agile data warehouse development.
After his agile BI presentation he did a separate presentation on Snowflake Computing.
It rocked my world!
Snowflake is a new elastic data warehouse database engine built from the ground up for doing data warehouses on Amazon Web Services (AWS).
Kent referenced his blog post: 10 Cool Things I like About Snowflake and went though some of his top 10 things he liked about Snowflake.
In my next blog post titled “Snowflake – Will it give Hadoop a run for the money?” I will tell you why I am so excited about this product and its many useful features. In a nutshell it reduces the complexity by at least an order of magnitude and allows for the delivery of data warehouses at a whole new pace. At a recent Cloud Analytics City Tour, a Snowflake customer did a presentation and had deployed a fully functional data warehouse from scratch in 90 days. With traditional data warehouse tools and vendors it can take more time than that just to negotiate the vendor contracts.