ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAP REDUCE FOR BIG DATA APPLICATIONS
Keywords:
Map Reduce, Hadoop, Data aggregation, Dynamic manner, Online algorithm, Distributed algorithmAbstract
Map Reduce job, we consider to aggregate data with the same keys before sending them to remote reduce
tasks. Although a similar function, called combine, has been already adopted by Hadoop, it operates immediately after a
map task solely for its generated data, failing to exploit the data aggregation opportunities among multiple tasks on
different machines. We jointly consider data partition and aggregation for a Map Reduce job with an objective that is to
minimize the total network traffic. In particular, we propose a distributed algorithm for big data applications by
decomposing the original large-scale problem into several subproblems that can be solved in parallel. Moreover, an
online algorithm is designed to deal with the data partition and aggregation in a dynamic manner. Finally, extensive
simulation results demonstrate that our proposals can significantly reduce network traffic cost in both offline and online
cases