Independent Benchmark Open-Source SQL-on-Hadoop Performance Q1 2014
Table of Contents
About Radiant Advisors Radiant Advisors is a leading strategic research and advisory firm that delivers innovative, cutting-edge research and thought-leadership to transform today’s organizations into tomorrow’s data-driven industry leaders.
Executive Summary
3
Takeaways and Insights
4
Benchmark Summary
8
To learn more, visit www.radiantadvisors.com. Closing
23
radiantadvisors.com | 2
Executive Summary Following several years of hype and headlines, 2014 is the year that mainstream companies are starting to adopt Hadoop to unlock business value -- with or without big data. Yet, while data scientists are the new rock stars of the data world, users that leverage decades of Structured Query Language (SQL) knowledge, technologies, and experience will continue to generate the majority of business value through its ease of accessibility.
SQL access to data stored in Hadoop was one of the earliest demands by business analysts that led to the creation of Hive. Hive-QL is a SQL interface to Hadoop MapReduce with limited syntax that enabled access for many early users. Since the creation of Hive, several alternatives have come to market. Cloudera released Impala 1.0 in May 2013 to provide interactive and higher performance SQL access to data stored in the Hadoop Distributed File System (HDFS) without using MapReduce. Opensource Presto also provides improved SQL capability and performance while accessing the HDFS directly. And, open-source InfiniDB for Hadoop comes from a massively parallel processing (MPP) relational database background to offer complex and analytic SQL capabilities with the high performance that users have come to expect. This independent performance benchmark focused on the performance dimensions of speed and SQL capabilities by comparing SQL-on-Hadoop options that are open-source and free to download for existing Apache, Hortonworks, and Cloudera distributions. We also illustrated the relationship to available data file encodings -- such as Optimized RC, SequenceFile, Parquet, or InfiniDB -- for compression, performance, and openness.
radiantadvisors.com | 3
Takeaways and Insights
radiantadvisors.com | 4
The Value of Open-Source SQL-on-Hadoop Performance The affordability and scalability of open-source Hadoop has democratized big data and analytic capabilities for all companies to exploit. While big data still means many things to many people today, solving big problems and creating new, innovative opportunities remain at the heart of this data platform. One of the most proven ways to generate business value comes from enabling as many people as possible within the company with broad access to data for discovery and analysis. Traditionally, business intelligence (BI) tools have enabled this access via SQL connectors to relational databases and data stores, such as data warehouses and data marts. Whether through reports or interactive analysis with business metrics and dashboards, SQL has become the most widely learned and accepted form of data access. A significant portion of data -- and big data -- contains some form of structure that can be accessed via the SQL paradigm. Unlocking business value with SQL access requires more than basic access to data within the Hadoop HDFS. Three critical factors of leveraging SQL-on-Hadoop -- Speed, Scalability, and SQL Capability -- deliver the optimum value.
Speed is the database response time from when a user submits the SQL statement to the time the returned data set begins.
Scalability is the number of nodes within a Hadoop cluster that the SQL statement can be executed on via the SQL engine.
SQL Capability is especially important to move beyond simple data access to realize more value from analytics within Hadoop.
Without interactive speed, performing data analysis is a slow process for the user and the system can be deemed unusable or unavailable. Reports and extractions can be scheduled.
Node storage sizes will continue to evolve, but if you need SQL access to a big data set then you must verify your SQL engine can be executed on all nodes within the Hadoop cluster.
Basic access begins with ANSI SQL-92 standard and connecting with tools, but more advanced SQL engines will support the much needed analytic-oriented functions from later standards. radiantadvisors.com | 5
Key Takeaways There are proven faster performing options to Hive 0.11 and 0.12 available for Hadoop. Presto 0.57 is much faster performing than Hive 0.11 and 0.12, but look to Impala and InfiniDB for enabling interactivity. InfiniDB leverages the MySQL connector found in nearly all tools; Hive, Presto, and Impala are gaining compatibility.
Hadoop Clusters with Presto
Hadoop Clusters with InfiniDB
Hadoop Clusters with Impala
Adding Presto can deliver 10x - 14x performance gains for Hadoop clusters that have data in ORC data files that are currently using Hive 0.11 or 0.12 with no changes to those data files.
The InfiniDB engine and file format delivered up to 40x performance gains over Hive for reporting queries and an additional 20x - 40x over Presto, while adding significant additional analytic SQL capabilities.
The Impala engine had significant performance gains that enable interactivity compared to Hive and Presto. Some queries required minor edits and it was not able to execute analytic queries 8 - 10.
Scheduled Hive jobs for reports or data extractions will benefit from using Presto with little change to SQL statements.
When both high performance and analytic SQL capabilities are required, migrating data into InfiniDB files is a good option. While IDB files can be accessed by MapReduce programs, they lack a broad community standard support.
Using Parquet partitioned files for Impala was based on the growing acceptance of Parquet files in HDFS for other interfaces within Hadoop. The Impala roadmap shows future support for analytic SQL functions.
ORC partitioned files were twice as fast with Presto than Hive, while SequenceFile partitioned files were twice as fast with Hive than Presto.
radiantadvisors.com | 6
Insights File Encoding Standards Data file encodings for HDFS can be optimized for compression, openness, and even SQL performance from a given SQL engine’s perspective. Four encodings were tested in the benchmark-- Optimized RC, SequenceFile, Parquet, and IDB -- with both date partitioned and non-partitioned keys. While performance by file encoding type is noted, the SQL engines themselves provide the orders of magnitude in gains and capability for evaluation. Hadoop Architecture As part of the Hadoop 2 architecture, YARN needs to be a consideration for your own Hadoop architecture strategy. Adopting the YARN-pluggable architecture for resource management versus allowing HDFS direct access approach can have implications for effectively managing your overall Hadoop cluster and ecosystem for application and user stability.
SQL Engines The rapid maturing of SQL-on-Hadoop engines will undoubtedly continue to benefit end users and strive to meet more and more of their demands. These engines will be relied upon to deliver the optimal performance combination of response times, scalability, and SQL language capability today and later in its roadmap. While it is possible to have all SQL engines co-existing in the Hadoop environment, understanding default and optimized engine roles will help in aligning requirements with engines. Balancing which of the dimensions of response time, scalability, or SQL capability is most critical for processes (like discovery and analysis versus advanced analytics on big or small data sets) will help in selecting which SQL engine should be used. Don’t expect any one SQL engine to meet all your Hadoop data access needs today: it’s okay to have more than one open-source and cost-effective option.
radiantadvisors.com | 7
Benchmark Results
radiantadvisors.com | 8
The Approach No performance benchmark is perfect as every user’s specific applications or workloads can vary between companies and are often unique. Though the TPC-DS benchmark strives to be exhaustive, this performance benchmark chose to limit variables and focus on a straightforward comparative: SQL engines performance on Hadoop. We compared open-source and free for download SQL engines with data loaded into various file formats in Hadoop, while striving to keep many other factors constant.
Benchmark Data Collection After loading data files all queries were manually run independently at least 10 times per each cluster, SQL engine, and file encoding format. Performance stats were based upon the average of 10 runs, with analysis of minimum, maximum, standard deviation, average deviation from mean, and percent deviation to average ratio for observed anomolies. No third party access tools were used and only interactive command line, scripts, or log files.
Fixed Variables & Out of Scope
Benchmark Application - Piwik
Scalability
The open-source web analytics application Piwik was selected for the performance benchmark for schema, seed data, and application queries. The Piwik queries are used to drive the web analytics dashboard application, and the Piwik schema included three tables representing user visits and actions. Three years of initial data was provided, and then was duplicated to represent 100x user activity over the same three years. In total, this schema represented approximately one terabyte of user data. SQL statements executed were based on requiring minimal SQL statement editing from Piwik app.
Concurrency Shuffling Data
- Each cluster was identical Amazon servers with one head node and four data nodes - Single user query concurrency with no other activities - Manually flushing cache and not running the same query multiple times in a row - No load performance or compression factors measured
radiantadvisors.com | 9
Configurations Amazon EC2 extra-large servers and EBS storage were used to configure three identical AWS clusters of five servers (one name node and four data nodes): Piwik schema tables (~1.1TB uncompressed) log_visit (visits to a site with many actions) log_link_visit_action (many-to-many relationship) log_action (individual URLs visited on sites)
1,412,062,500 rows 6,484,811,700 rows 2,539,682 rows
Query Types: Queries 1 - 4 represent Reporting Queries 5 - 7 represent Adhoc Queries 8 - 10 represent Analytic
Cluster 1 Hortonworks 1.3 Hive 0.11 Presto 0.57 InfiniDB Enterprise 4.0
Cluster 2 Hortonworks 2.0 Hive 0.12 Presto 0.57 InfiniDB Enterprise 4.0
Cluster 3 Cloudera CDH5 beta Hive 0.11 Impala 1.2 InfiniDB Enterprise 4.0
ORC partitioned files SequenceFile partitioned files IDB columnar files
ORC partitioned files SequenceFile partitioned files IDB columnar files
Parquet partitioned files IDB columnar files
Using Amazon-based EC2 and EBS for Hadoop clusters further demonstrates the accessibility and quick deployment in the cloud
radiantadvisors.com | 10
Benchmark Results Hive completed 4 of the 10 benchmark queries where date partitions were leveraged. Hive could likely complete these with more nodes, but would require a much longer time or significant SQL rewriting. Impala was the fastest on the reporting queries using Parquet-partitioned files, but still 4x - 12x slower than InfiniDB on adhoc queries and was not able to execute analytic queries 8 - 10. InfiniDB completed all queries in the benchmark with exceptional times using the IDB file format while slightly slower than Impala on the reporting queries. Presto completed 9 of the 10 benchmark queries significantly faster than Hive, however was still multiple times slower than Impala or InfiniDB.
Hive 0.12
Presto 0.57
InfiniDB 4.0
Impala 1.2
Q1
Q2
Q3
Q4 Q5
Q6
Q7
Q8 Q9
Clusters 1 and 2 implementations of Hortonworks 1.3 and 2 demonstrated only a slight difference in results. Therefore, the analysis only discusses Hadoop 2. There was a consistent and inverted performance gain between ORC-partitioned and SEQ-partitioned files for both Hive and Presto in clusters 1 and 2. Only ORC-partitioned files were included in the analysis
Q10 0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
8 00.0
Seconds
radiantadvisors.com | 11
Query 1 Reporting Query: Select statement with subselect containing countdistinct, filtering, ordering, group by, order by, and limit leveraging date partition.
Cluster 2 Hive syntax example (10,000 rows returned) select name, type, subq.* from (select idaction_url idaction, count(distinct idvisit) dis_visit , count(*) subquery_count from log_link_visit_ action where (part_date between ‘2011-01-01’ and ‘2011-01-31’ and idsite = 5 and idaction_url > 0 ) group by idaction_ url order by subquery_count desc limit 10000) subq join log_action_tbl on (subq. idaction = log_ action.idaction) order by subquery_count desc ;
Cluster 3
Hortonworks 2.0 ORC/Partitioned IDB Files
47x
Cloudera Beta 5 PARQUET/Partitioned IDB Files
Hive 0.12
721.21 s
45.24 s InfiniDB 4.0 15.34 s Presto 0.57
Seconds
1.4x
Hive 0.11 Impala 1.2 InfiniDB 4.0
not tested 12.48 s 17.52 s Seconds
Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).
Comments: Impala is fastest with InfiniDB within a few seconds on clusters 2 and 3 Hive 0.11 on Cloudera intentionally not tested with Parquet files Presto and Impala added “limit 10000” to end of the query InfiniDB retained the implicit numbered column alias in GROUP BY from Piwik
radiantadvisors.com | 12
Query 2 Reporting Query: Select statement with countdistinct, filtering, ordering, group by, order by, and limit leveraging date partition.
Cluster 2 Impala syntax example (10,000 rows returned) select visit_entry_idaction_url idaction, count(distinct visitor_ idcookie) INDEX_NB_UNIQ_ VISITORS, count(*) INDEX_PAGE_ NB_HITS from log_visit where (part_date >= ‘2011-01-01’ and part_date <= ‘2011-01-31’ and idsite = 5 and visit_entry_idaction_ url > 0 ) group by visit_entry_ idaction_url order by INDEX_PAGE_ NB_HITS desc limit 10000;
Cluster 3
Hortonworks 2.0 ORC/Partitioned IDB Files
38x
Hive 0.11
Cloudera Beta 5 PARQUET/Partitioned IDB Files
129.29 s
Presto 0.57 18.89 s InfiniDB 4.0 3.40 s
Seconds
1.2x
not tested
Hive 0.11 Impala 1.2
3.10 s
InfiniDB 4.0
3.98 s Seconds
Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).
Comments: Impala is fastest with InfiniDB within one second on clusters 2 and 3 Hive 0.11 on Cloudera intentionally not tested with Parquet files Presto and Impala added “limit 10000” to end of the query InfiniDB retained the implicit numbered column alias in GROUP BY from Piwik InfiniDB specifies another schema for INDEX radiantadvisors.com | 13
Query 3 Reporting Query: Select statement with sum group by, filtering, and ordering leveraging date partition.
Cluster 2 Hive syntax example (21,552 rows returned) select idaction_url_ref, sum(time_ spent_ref_action) INDEX_PAGE_ SUM_TIME_SPENT from log_link_visit_ action where (part_date between ‘2011-0101’ and ‘2011-01-31’ and idsite = 5 and time_spent_ref_ action > 0 and idaction_url_ref > 0 ) group by idaction_url_ref order by idaction_url_ref desc;
Cluster 3
Hortonworks 2.0 ORC/Partitioned IDB Files
47x
Hive 0.11
Cloudera Beta 5 PARQUET/Partitioned IDB Files
364.59 s
24.53 s InfiniDB 4.0 7.69 s Presto 0.57
Seconds
1.3x
not tested
Hive 0.11 Impala 1.2 InfiniDB 4.0
6.30 s 8.60 s Seconds
Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).
Comments: Impala is fastest with InfiniDB within one second on clusters 2 and 3 Hive 0.11 on Cloudera intentionally not tested with Parquet files Presto and Impala added “limit 10000” to end of the query InfiniDB retained the implicit numbered column alias in GROUP BY from Piwik InfiniDB specifies another schema for INDEX radiantadvisors.com | 14
Query 4 Reporting Query: Select statement with sum, filtering, ordering, and limit leveraging date partition.
Cluster 2 Impala syntax example (7,177 rows returned) select idaction_name_ref, sum(time_ spent_ref_action) from log_link_visit_action where (part_date >= ‘2011-01-01’ and part_date <= ‘2011-01-31’ and idsite = 5 and time_spent_ref_action > 0 and idaction_name_ref > 0 ) group by idaction_name_ref order by idaction_name_ref desc limit 10000;
Cluster 3
Hortonworks 2.0 ORC/Partitioned IDB Files
46x
Hive 0.11
Cloudera Beta 5 PARQUET/Partitioned IDB Files
358.06 s
23.20 s InfiniDB 4.0 7.75 s Presto 0.57
Seconds
1.4x
not tested
Hive 0.11 Impala 1.2 InfiniDB 4.0
6.10 s 8.62 s Seconds
Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).
Comments: Impala is fastest with InfiniDB within one second on clusters 2 and 3 Hive 0.11 on Cloudera intentionally not tested with Parquet files Presto and Impala added “limit 10000” to end of the query InfiniDB retained the implicit numbered column alias in GROUP BY from Piwik InfiniDB specifies another schema for INDEX radiantadvisors.com | 15
Query 5 Adhoc Query: Select statement with date functions, no joins, filtering, and grouping by date functions without leveraging date partition in where clause.
Cluster 2 InfiniDB syntax example (21,552 rows returned) select year(visit_server_date), month(visit_ server_date), count(distinct idvisit) from log_visit where visit_entry_idaction_url = 2301411 group by year(visit_server_date), month(visit_server_date);
Cluster 3
Hortonworks 2.0 ORC/Partitioned IDB Files
38x
Hive 0.11
Cloudera Beta 5 PARQUET/Partitioned IDB Files
not able to complete
48.16 s InfiniDB 4.0 1.28 s Presto 0.57
12x
InfiniDB 4.0
Seconds
not tested
Hive 0.11 Impala 1.2
18.66 s 1.60 s Seconds
Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).
Comments: InfiniDB is fastest with Impala 12x slower but still twice as fast as Presto Hive 0.12 was not able to complete with resource issues noted in log files Hive 0.11 on Cloudera intentionally not tested with Parquet files Presto required data parse for implicit string to date conversions
radiantadvisors.com | 16
Query 6 Adhoc Query: Select statement with date functions, count-distinct, no joins, filtering, and grouping by date functions without leveraging date partition in where clause.
Cluster 2 Impala syntax example (726 rows returned) select year(server_time), month(server_time), idaction_url, count(distinct idvisit) from log_link_visit_action where idaction_url between 2301400 and 2302400 group by year(server_time), month(server_time), idaction_url;
Cluster 3
Hortonworks 2.0 ORC/Partitioned IDB Files
23x
Hive 0.11 Presto 0.57
Cloudera Beta 5 PARQUET/Partitioned
not able to complete 142.46 s
InfiniDB 4.0 6.18 s
Seconds
9.9x
not tested
Hive 0.11 Impala 1.2 InfiniDB 4.0
69.38 s 6.99 s Seconds
Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).
Comments: InfiniDB is fastest with Impala 10x slower but still twice as fast as Presto Hive 0.12 was not able to complete with resource issues noted in log files Hive 0.11 on Cloudera intentionally not tested with Parquet files
radiantadvisors.com | 17
Query 7 Adhoc Query: Select statement with subselect containing count, join, filtering, ordering, and group by without leveraging date partition in where clause.
Cluster 3
Cluster 2 InfiniDB syntax example (2 rows returned) select sub.visit_entry_idaction_url, sub.name, lv.referer_url, count(sub. idvisit) from log_visit lv, (select visit_entry_idaction_url, name, idvisit from log_visit join log_action on (visit_en책try_idaction_url = log_ action.idaction) where visit_entry_idaction_url between 2301400 and 2302400) sub where lv.idvisit = sub.idvisit group by 1, 2, 3;
Cloudera Beta 5 PARQUET/Partitioned IDB Files
Hortonworks 2.0 ORC/Partitioned
16x
Hive 0.12
not able to complete
77.41 s InfiniDB 4.0 4.66 s Presto 0.57
Seconds
4x
not tested
Hive 0.11 Impala 1.2 InfiniDB 4.0
24.18 s 5.85 s Seconds
Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).
Comments: InfiniDB is fastest with Impala 4x slower but still three times faster than Presto Hive 0.12 was not able to complete with resource issues noted in log files Hive 0.11 on Cloudera intentionally not tested with Parquet files InfiniDB retained the implicit numbered column alias in GROUP BY from Piwik
radiantadvisors.com | 18
Query 8 Analytic Query: Select statement with subselect containing countdistinct, filtering, ordering, group by, order by, and limit.
Cluster 2
Hortonworks 2.0 ORC/Partitioned IDB Files
InfiniDB syntax example (274 rows returned) select sub.visit_entry_idaction_url, sub.name, lv.referer_url, sum(visit_ total_time) total_time, count(sub.idvisit), RANK () OVER (PARTITION BY sub. visit_entry_idaction_url ORDER BY count(sub.idvisit)) rank_by_visits, DENSE_RANK() OVER (PARTITION BY sub.visit_entry_idaction_url order by count(visit_total_time)) rank_by_ time_spent from log_visit lv, (select visit_entry_idaction_url, name, idvisit from log_visit join log_ action on (visit_entry_idaction_url = log_action.idaction) where visit_ entry_idaction_url between 2301400 and 2302400) sub where lv.idvisit = sub.idvisit group by 1, 2, 3 order by 1,6,7;
80x
not executable
Hive 0.12
506.84 s
Presto 0.57 InfiniDB 4.0 Impala 1.2
6.37 s not executable Seconds
Comments: Rank and Dense Rank analytic functions in SQL statement InfiniDB is fastest compared with Presto Hive 0.12 and Impala were not able execute SQL statement InfiniDB retained implicit numbered column alias in GROUP and ORDER BY
radiantadvisors.com | 19
Query 9 Analytic Query: Select statement with sum, count and cumlative distribution with sub-select containing a join, filtering, group by, and order by.
Cluster 2
Hortonworks 2.0 ORC/Partitioned IDB Files
InfiniDB syntax example (274 rows returned) select sub.visit_entry_idaction_url, sub.name, lv.referer_url, sum(visit_ total_time) total_time, count(sub.idvisit) total_visits, CUME_DIST() OVER (PARTITION BY sub.visit_entry_idaction_url ORDER BY sum(visit_total_time) desc) cumulative_distribution from log_visit lv, (select visit_entry_idaction_url, name, idvisit from log_visit join log_action on (visit_entry_ idaction_url = log_action.idaction) where visit_entry_idaction_url between 2301400 and 2302400) sub where lv.idvisit = sub.idvisit group by 1, 2, 3 order by 1,5, 6;
75x
not executable
Hive 0.12
471.43 s
Presto 0.57 InfiniDB 4.0 Impala 1.2
6.24 s not executable Seconds
Comments: Cumulative Distribution analytic functions in SQL statement InfiniDB is fastest compared with Presto Hive 0.12 and Impala were not able execute SQL statement InfiniDB retained implicit numbered column alias in GROUP and ORDER BY
radiantadvisors.com | 20
Query 10 Analytic Query: Select statement with standard deviation, variance population with sub-select containing date functions, count-distinct, sum, filtering, ordering, group by and order by.
Cluster 2
Hortonworks 2.0 ORC/Partitioned IDB Files
InfiniDB syntax example (1,052 rows returned)
select action_year, action_month, idaction_url, sum_visit_time, visit_count, stddev_pop(sum_ visit_time) OVER (PARTITION BY idaction_url ORDER BY action_ year,action_month) visit_time_dev, var_pop(sum_visit_time) OVER (PARTITION BY idaction_url ORDER BY action_year,action_month) visit_ time_variance from(select year(server_time) action_year,month(server_time) action_month,idaction_url, count(distinct idvisit) visit_count, sum(time_spent_ref_action) sum_ visit_time from log_link_visit_action where idaction_url between 2301400 and 2302400 group by year(server_time), month(server_time),idaction_url) a order by 3,1,2;
not executable not executable
Hive 0.12 Presto 0.57 InfiniDB 4.0 Impala 1.2
Comments:
7.01 s
not executable not executable Seconds
Standard Deviation and Variance analytic functions in SQL statement InfiniDB was quick and the only SQL engine able to execute Hive 0.12, Presto, and Impala were not able execute SQL statement InfiniDB retained implicit numbered column alias in GROUP and ORDER BY
radiantadvisors.com | 21
SQL Engine
Hive 0.11
All values are in seconds of SQL response time All values represent the average of 10 independent runs “x” denotes SQL could not be executed “n/t” denotes not test as not required “n/c” denotes not able to complete
File Encoding
Impala 1.2
InfiniDB 4.0
PARQUET-PART
IDB
12.48
17.52
3.10
3.98
6.30
8.60
6.10
8.62
Adhoc Q-05 Adhoc Q-06 Adhoc Q-07
18.66
1.60
69.38
6.99
24.18
5.85
Analytic Q-08 Analytic Q-09 Analytic Q-10
x
6.17
x
5.53
x
7.37
IDB
516.12
277.18
52.28
96.35
15.36
130.09
102.59
26.25
64.40
3.71
290.35
140.05
27.95
73.18
7.74
289.35
137.47
27.14
66.88
7.81
Adhoc Q-05 Adhoc Q-06 Adhoc Q-07
n/c
n/c
74.32
371.38
1.70
n/c
n/c
161.94
807.75
7.23
n/c
n/c
86.74
388.15
6.91
Analytic Q-08 Analytic Q-09 Analytic Q-10
x
x
n/t
n/t
n/t
x
x
n/t
n/t
n/t
x
x
x
x
n/t
SQL Engine File Encoding
Reporting Q-01 Reporting Q-02 Reporting Q-03 Reporting Q-04
InfiniDB 4.0
Reporting Q-01 Reporting Q-02 Reporting Q-03 Reporting Q-04
Cluster 3 - Cloudera CDH5 beta SQL Engine
Presto 0.57
ORCPART SEQPART ORCPART SEQPART
File Encoding
Summary of Benchmark Test Results
Cluster 1 Hortonworks 1.3
Cluster 2 - Hortonworks 2.0 Hive 0.12
Presto 0.57
InfiniDB 4.0
ORCPART SEQPART ORCPART SEQPART
IDB
Reporting Q-01 Reporting Q-02 Reporting Q-03 Reporting Q-04
721.21
325.68
45.24
116.85
15.34
129.29
91.39
18.89
71.84
3.40
364.59
135.67
24.53
86.22
7.69
358.06
131.72
23.20
92.66
7.75
Adhoc Q-05 Adhoc Q-06 Adhoc Q-07
n/t
n/t
48.16
374.44
1.28
n/t
n/t
142.46
866.70
6.18
n/t
n/t
77.41
462.09
4.66
Analytic Q-08 Analytic Q-09 Analytic Q-10
x
x
506.84
n/t
6.37
x
x
471.43
n/t
6.24
x
x
x
n/t
7.01
radiantadvisors.com | 22
Closing
radiantadvisors.com | 23
Closing Comments This year, we will continue to see the evolution and maturing of these SQL engines along with the possible addition of new open-source options. This independent benchmark serves as a snapshot of where we were in Q1 of the 2014 Hadoop ecosystem. The subset of vendors and software components included in this benchmark were chosen for their architecture commonality and availability to mainstream companies looking to benefit from SQLon-Hadoop processes and tools. The Hadoop ecosystem continues to evolve with the recent GA release of Hadoop 2 and YARN. The exciting and fast-changing Hadoop ecosystem can be a challenge to keep up with as we follow encouraging stats on Hive 0.13 and Tez, Presto 0.60 is now available, and Parquet adoption continues to gain traction. We look forward to benchmarking these in the future.
Join the Conversation This independent performance benchmark is intended to be part of an open, ongoing conversation to share and learn from each otherâ&#x20AC;&#x2122;s experiences and perspectives. Please share your comments online with us at RadiantAdvisors.com or on Twitter @RadiantAdvisors.
We also recognize that other vendors have valuable solutions to offer companies with different packaging and business models. Our upcoming independent performance benchmarks will continue to strive for proper product categorizations and focused â&#x20AC;&#x153;applesto-applesâ&#x20AC;? comparisons.
radiantadvisors.com | 24
Acknowledgements InfiniDB engaged Radiant Advisors to perform an independent performance benchmark and analysis of preliminary findings conducted at InfiniDB. While InfiniDB is the sole sponsor of this report, Radiant Advisors has independently set up the cluster configurations, testing approach, benchmark execution, all data collection, and analysis. All findings and takeaways are solely those of Radiant Advisors. Special thanks to Dr. Nathan Halko, resident data scientist, for providing an independent review of the test approach, data, and analysis. Special thanks to Dundas Data Visualization, for providing data visualizations and visual engineering support to Lindy Ryan, Research Director, Data Discovery and Visualization at Radiant Advisors.
References Piwik.org - Open-source web analytics platform Prestodb.io - Distributed SQL Query Engine for Big Data Hortonworks.com - HDP distribution of Apache Hadoop Cloudera.com - CDH distribution and Impala InfiniDB.co - InfiniDB for Apache Hadoop Parquet.io - Columnar file format for Hadoop
radiantadvisors.com | 25
Radiant Advisors Independent Benchmark Report April 2014 Copyright Š 2014 Radiant Advisors. All Rights Reserved. InfiniDB and the InfiniDB logo and any other product or service names or slogans contained in this document are trademarks of InfiniDB and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written consent of InfiniDB or the applicable trademark holder. Hadoop, Sqoop, and the Hadoop elephant logo are trademarks of the Apache Software Foundation. Dundas logo is a trademark of Dundas Data Visualization, Inc. All other trademarks, registered trademarks, product names, and company names or logo mentioned in this document, include Radiant Advisors and its affiliated brands and logos, are the property of their respective owners. Reference to any products, services, processes, or other information, by trade name, trademark, manufacturer, and supplier or otherwise does not constitute or imply endorsement, sponsorship, or recommendation thereof by Radiant Advisors. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any or by any means, or for any purpose, without the express written permission of Radiant Advisors. Radiant Advisors may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this report. Except as expressly provided in any written license agreement from Radiant Advisors, the furnishing of this document does not give you the license to these patents, trademarks, copyrights, or other intellectual property. The information in this document is subject to change without notice. Radiant Advisors has made every effort to ensure that all statements and information contained in this document are accurate, but accepts no liability for any error or omission in the same.