Please use this identifier to cite or link to this item: http://repository.hneu.edu.ua/handle/123456789/23444
Title: Experimental research of optimizing the Apache Spark tuning: RDD vs data frames
Authors: Minukhin S. V.
Novikov M.
Brynza N. O.
Sitnikov D. E.
Keywords: Apache Spark
resilient distributed dataset
Data Frames
HDFS
shuffling
level of parallelism
data processing
data set
data set
application
execution time
Issue Date: 2020
Citation: Minukhin S. Experimental research of optimizing the Apache Spark tuning: RDD vs Data Frames / S. Minukhin, M. Novikov, N. Brynza, D. Sitnikov // Proceedings of The Third International Workshop on Computer Modeling and Intelligent Systems (CMIS-2020), April 27-May 1. - Zaporizhzhia, 2020. - PP. 409-425.
Abstract: In this paper results and analysis of experimental research for determining the effectiveness of changing the parameters (as compared to standard values) of tuning Apache Spark for minimizing application execution time have been presented. The structure of a test dataset has been developed using RDD and Data Frames, based on which it is possible to create during a minimal time text files with a size greater than 4 GB having properties (characteristics) set up for testing. A peculiarity of test data is the fact that they often reflect basic properties of real world problems. The investigation includes 2 stages: at the first stage a comparative analysis of RDD and Data Frames is carried out for the standard settings of Apache Spark; at the second stage experiments for different sizes of an input test dataset for assessing the influence of parallelism levels, a block size in HDFS and the parameter spark.sql.shuffle.partitions in Spark Data Frames have been conducted. The obtained results substantiate the influence of the spark.sql.shuffle.partitions value on the test task execution performance. For this parameter ranges and change trends have been found. Also, levels of parallelism that maximally influence the execution time have been determined. It has been proven that for certain sizes of input test files the size of an HDFS block can be set up by default. Results of computational experiments have been demonstrated in tables and graphs. They confirm the effectiveness of the suggested changes to the Apache Spark settings as compared with the standard ones for different sizes of tested files.
URI: http://repository.hneu.edu.ua/handle/123456789/23444
Appears in Collections:Статті (ІКТ)

Files in This Item:
File Description SizeFormat 
paper31.pdf503,34 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.