IMPROVEMENT OF SPARK STREAMING THROUGH BATCH SIZING WITH PERFORMANCE ANALYSIS
Keywords:
Streaming; Spark; Batch Sizing; Spark Streaming; Dynamic BatchAbstract
The need for real-time processing of “big-data” has led to the development of framework for distributed
stream processing in cluster. It is important for such framework to be robust against variable operating conditions such
as server failures, changes in data ingestion rates, and workload characteristics.
To provide fault tolerance and efficient stream processing at scale, recent stream processing framework have proposed
to treat streaming workloads as a series of batch jobs on small batches of streaming data. The robustness of such
framework against variable operating conditions has not been explored.
We explore the effects of the batch size on the performance of streaming workloads. The throughput and end-to-end
latency of the system can have complicated relationships with batch sizes, data ingestion rates, variations in available
resources, workload characteristics, etc. We propose a simple yet robust control algorithm that automatically adapts the
batch size as the situation necessitates.