北京新科院 | 北京尖峰新锐信息科技研究院

Overview

The BigData100 is an open source project for benchmarking and ranking big data systems. The benchmarks we use are selected from BigDataBench . Currently, the big data systems include Hadoop, Spark, Flink, Hive on Hadoop, Impala, and the workloads cover offline batch processing, iterative machine learning, and interactive query processing from different domains. We will include more systems and workloads in near future.

Benchmarks

The benchmarks come from BigDataBench . The batch processing and iterative machine learning part support Hadoop, Spark and Flink, while the interactive query processing benchmarks cover systems like Impala, Spark SQL, Hive on Hadoop, Hive on Tez and Hive on Spark. In the first part, there are 4 data sets and 6 workloads in BigData100. Table 1 summarizes the real-world data sets and scalable data generation tools included into BigData100; Table 2 presents the workloads of BigData100 for batch processing and machine learning.

Table 1: The Summary of Data Sets

Data sets	Raw data size	Scalable data set
Wikipedia Entries	1.600,000,000 English words (unstructured text)	Text Generator of BDGS of BigDataBench
Amazon Movie Reviews	5,700,000 reviews (semi-structured text)	Text Generator of BDGS of BigDataBench
Google Web Graph	16777216 nodes, 99184770 edges (unstructured graph)	Graph Generator of BDGS of BigDataBench
Facebook Social Network	460,000,000 vectors	Graph Generator of BDGS of BigDataBench

Table 2. The summary of the workloads in BigData100

TeraSort	IO-Intensive	Text	From Hadoop
WordCount	CPU-Intensive	Text	From BigDataBench
Grep	CPU-Intensive	Text	From BigDataBench
PageRank	Hybrid	Graph	From BigDataBench
K-means	CPU-Intensive	Graph	From BigDataBench
NaiveBayes	CPU-Intensive	Text	From BigDataBench

Contributors

Jingwei Li, BAFST, lijingwei@mail.bafst.com

Xinhui Tian, ICT, CAS, tianxinhui@ict.ac.cn

Jianfeng Zhan, ICT, CAS, zhanjianfeng@ict.ac.cn

TestBed

Now we use a 16-node cluster as the testbed. Table 3 summarizes the configurations of systems.

Table 3: The configurations of cluster

Computation nodes	16 nodes
CPU per node	2*Intel Xeon E5645
Memory per node	32GB
Disk per node	1TB x 2 SATA disks
Network	Broadcom NetXtreme II Gigabit Ethernet

Results | 2016.01

We released some preliminary results of BigData100.

Part 1: Batch Processing and iterative Machine Learning

The first BigData100 ranking report covers Spark, Hadoop, and Flink. Table 4 shows the performance numbers of different systems running different benchmarks.

Table 4: the performance metrics of different systems (seconds)

Workload	Hadoop v. 2.7.1	Spark v 1.5.1	Flink 0.9.1
WordCount	680	1620	1105
Grep	672	1860	1676
NaiveBayes	490	780	563
PageRank	7753	780	598 (469 in delta PageRank)
Kmeans	6722	672	805

WordCount

Grep

NaiveBayes

PageRank

Kmeans

Part 2: Interactive Query Processing

In this part, we have tested five systems with three queries included in BigDataBench. The queries adopt a schema of e-commerce includeing three tables of items, customers, and orders. These three queries cover common operations such as select, aggregation, and join. The data size is described in Table 5, and the results are shown as below. Currently, we only consider the text format for these three tables.

Table 3: The configurations of cluster

Itme	2.19 GB
Customer	14.14 GB
Order	101.99 GB

Select

Aggregation

Join

Runtime System Configurations

Table 6 presents the details of different big data run-time systems configurations

Table 6: The configurations of big data systems

Hadoop

Version	2.7.1
yarn.nodemanager.resource.cpu-vcores	12
yarn.scheduler.maximum-allocation-vcores	12
yarn.nodemanager.resource.memory-mb	22528
mapreduce.map.memory.mb	1024
mapreduce.reduce.memory.mb	1024

Spark

Version	1.5.1
SPARK_EXECUTOR_CORES	12
SPARK_EXECUTOR_MEMORY	22G
Spark.storage.memoryFraction	0.2 (0.5 for iteration)
Spark.shuffle.manager	Sort
Spark.default.parallelism	600

Flink

Version	0.9.1
Taskmanager.heap.mb	22528
Taskmanager.numberOfTaskSlots	12
Parallelism.default	150
Taskmanager.network.numberOfBuffers	9000