Databricks – Photon
The Databricks platform offers two execution engines for the clients: the standard Apache Spark (available as an open-source application) and one with Photon enhancement that brings a performance improvement (as well as extra pricing). Have you ever wondered where this speedup comes from and how it affects designing Apache Spark jobs?
This article is based on Berlkey’s paper on Photon by Databricks people, as that publication is the closest to the source, as the Photon engine is not an open-source project, and its source code is not available to the public.
The general idea
In a nutshell, Photon replaces the standard (bundled) query engine available in Apache Spark, using the same API. For some operations, mostly CPU-heavy ones, the Catalyst optimizer may decide to send the job to Photon instead of using the default execution path, all for performance reasons. In other words, it is an alternative way to execute Spark DAG tasks, not skipping/reordering/reorganizing the data engine.
SIMD
The SIMD stands for Single Instruction, Multiple Data and is one of the ground ideas for job optimisations in Photon. With current architectures running the same operation on multiple instances (values, rows, etc.), data can be optimised via vectorisation, even within a single thread. Photon is said to utilise those optimisations.
C++ and Code generation
The Photon engine is implemented in C++ instead of the JVM native languages (Scala, Java or others). The source paper points to performance reasons, including ‘hitting performance ceilings’. The communication with the rest of the Apache Spark is implemented via JNI. Databricks’ internal benchmarks indicate that the performance hit due to moving data in and out of JVM is not noticeable.
Internal data format
Photon engine uses columnar data representation (same as, e.g. Parquet data format) instead of row data (like the rest of Apache Spark). This is due to SIMD optimizations – kernel implementation that works best on columnar data. The memory management (calling, freeing, etc) is still done via Apache Spark’s memory manager. The data is kept off-heap, so transferring from Photon to Spark does not require copying the data.
When a shuffle operation is necessary, Photon writes a shuffle file and uses Spark API to execute the exchange. However, the data format is not compatible with vanilla spark, and a Photon shuffle read must follow the Photon shuffle write.
When does it help?
Photon is meant to address the CPU-heavy loads. This includes joins (especially hash join) and aggregations. On the other hand, being a non-JVM implementation, Photon obviously does not support UDFs or RDD API. Exact benchmarks and precise speedups are mentioned in the source paper.
Sources
- Source paper: Photon: A Fast Query Engine for Lakehouse Systems
I hope this helps. Moreover, if you know any other good sources, do let us know on social media so that everyone can see them.