The technicalities 3 min read 28 June 2024

RDD in Apache Spark

Author

Big Data Engineer

Do you want to see the number of partitions? Or the partition size in rows from within the job? Or maybe you just like some low-level hacking? The RDD API is still there and accessible via the .rdd method.

Why is RDD API still around?

Despite deprecating the RDD API, the engine of Apache Spark (at least its Open Source part – see our article on Photon), the RDDs are still used in its internals. It is also available via the .rdd method of Datasets, and, therefore, DataFrames as well. Keep in mind, though, that the number of actually useful operations not available from Dataset API is really low, and currently, excluding some low-level or Spark internals hacking, it boils down to partitions count and size checking – using getNumPartitions method or glom operator, respectively.

Check number of partitions

RDD API still keeps the getNumPartitions method for that use case:

println(inputData.rdd.getNumPartitions)

Check number of rows per partition

The glom function coalesces all rows in each partition into an array. It can be used to check the number of rows per partition. In the case of wide rows, consider using select to limit the number of columns—RDD are not optimized by most of Spark’s mechanisms.

inputData.rdd.glom().map(_.length).collect().foreach(println _)

More insights

see all articles.

RDD in Apache Spark

Why is RDD API still around?

Check number of partitions

Check number of rows per partition

More insights

Datasets and DataFrames

Spark shuffle – Case #1 – partitionBy and repartition

Spark shuffle – Case #2 – repartitioning skewed data

Spark shuffle – Case #3 – using salt in repartition