SQL-on-Hadoop tools help users navigate enterprise Hadoop course
by Jack Vaughan
Hadoop may be a technology in waiting, unless SQL-on-Hadoop tools turn it into an enterprise mainstay.
Hadoop can handle large amounts, and all kinds, of data, but its core Java APIs are often complex to use – same goes for the ones in MapReduce, its original programming environment and compute engine. As a result, the open source distributed processing framework runs the risk of remaining the king of big data proof-of-concept projects. But SQL-on-Hadoop technology opens up a possible path to wider adoption.
In a June 2014 blog post, Ovum analyst Tony Baer wrote that SQL could serve as the "gateway drug" to Hadoop use in enterprise applications. The reason: It enables organizations to tap the armies of workers adept in SQL, the standard programming language for relational databases, to build and run analytical queries against Hadoop data.
Assorted new tools provide ways to bring SQL into Hadoop applications -- and they're starting to find a place in IT environments.
"The workloads I now see are 80% to 90% SQL on top of Hadoop," said Mac Noland, a solutions architect at phData, which offers Hadoop consulting services to organizations in the Minneapolis-St. Paul, Minn., area. "That's because there are a lot of skilled SQL people out there. It's a very natural fit."
What the data doctor ordered
For ZirMed Inc., a vendor of software as a service applications for healthcare administration uses based in Louisville, Ky., SQL querying support was a crucial factor in making the deployment of a Hadoop cluster feasible. "If we didn't have a SQL level on top of Hadoop, we would not have implemented Hadoop," said Chris Schremser, ZirMed's CTO.
Schremser said that programming Java in jobs in MapReduce wouldn't have been "conducive to developer productivity" in his organization. Instead, ZirMed is using Hortonworks' Hadoop distribution along with Apache Hive, open source software that lets SQL-savvy developers and business users at the company query data stored in the Hadoop Distributed File System (HDFS).
ZirMed installed the Hadoop cluster in the fall of 2014, with 29 compute nodes and 1.2 PB of raw storage capacity. The system is being used to collect data on medical payments and insurance claims from the company's customers for analysis. The information is straightforward transaction data, not the unstructured or semi-structured forms of data often found in HDFS, and the jobs that Schremser's team runs against it with Hive aren't highly interactive. But he called them "big queries" -- for example, answering questions on insurance eligibility that require churning through more than two years of patient records and transactions.
"We don't have an unstructured [data] problem today," Schremser said. "Our issue is quantity. We had so much data that querying it could be a problem."
Previously, ZirMed ran some queries on a data warehouse appliance and others directly on its transaction processing system because of storage limitations in the data warehouse. In the latter case, query jobs that took hours -- or multiple days -- to complete against the transaction database have been reduced to a handful of minutes on the Hadoop cluster, according to Schremser.
ZirMed also put Hadoop and Hive into production with an eye toward reduced costs going forward, he continued. According to a case study write-up posted on the Hortonworks website, the hardware for the Hadoop cluster cost $235,000, about 30% of what Schremser said the company had spent on the data warehouse system. And for that price, the cluster contains five times more usable storage space than the warehouse provided.
Drill-ing into Hadoop data
Michael Fabacher is another SQL-on-Hadoop user. Fabacher is vice president of data architecture and database development at Cardlytics, an Atlanta-based company that analyzes credit and debit card purchase data to help retailers and restaurants send targeted promotional offers to consumers through card-issuing banks. Cardlytics began working with the Hadoop distribution from MapR Technologies in September 2014, and it has also deployed Apache Drill, an open source SQL-on-Hadoop tool whose development is being spearheaded by MapR.
As at ZirMed, using the Hadoop cluster to process and analyze pressing volumes of structured transaction data is the first order of business for Cardlytics. Fabacher's team did a bake-off between four SQL-on-Hadoop tools before selecting Drill to aid with the analytics. "None of them could support everything we wanted," he said, pointing to the relative newness of the technologies. But he gave credit to MapR for working with Cardlytics to add specific SQL functionality that the company required, including window functions for gathering values from a set of rows in a data store, automatic partitioning of data to boost query performance and new types of joins.
"Drill is not fully matured, but we think it will be," Fabacher said. The tool just became available in a 1.0.0 version in May, followed by a 1.1.0 release earlier this month that incorporated the window functions and auto partitioning sought by Cardlytics along with other new features.
While the large ranks of SQL-skilled workers should give some comfort to organizations embarking on Hadoop journeys, the growing ranks of SQL-on-Hadoop options could be unsettling. Since the first days of Apache Hive, the field has become increasingly crowded. In a presentation at the 2015 Pacific Northwest BI Summit in Grants Pass, Ore., this month, Gartner analyst Merv Adrian listed 14 different tools -- and that's not a full count of what's available. In addition to Hive and Drill, other prominent examples include Impala, developed by Hadoop vendor Cloudera; IBM's Big SQL; Pivotal Software's Hawq; Spark SQL; and Presto, an open source tool that like Hive originated at Facebook.
One gateway, many paths
According to Matt Aslett, an analyst at 451 Research, the sources of commercial SQL-on-Hadoop tools can be broadly broken down into three categories: independent Hadoop distribution providers, incumbent relational database and data warehouse providers that are "looking to get into the act," and independent startups specifically focused on the technology.
Aslett said Hive has matured significantly, but its use is still oriented largely toward long-running MapReduce batch jobs. He added that he expects interest to continue to grow in newer tools like Drill, Impala and Presto that take inspiration from Google's Dremel distributed query tool and focus on applications supporting higher degrees of interactivity than is common on Hive queries.
Overall, though, it's still early going for SQL on Hadoop, and Aslett said some patience is required when looking to put the tools into production use. "They are definitely a work in progress," he noted.
Executive editor Craig Stedman contributed to this story.