RDD in Apache Spark
Learn how to utilize the RDD API in Apache Spark to check partition details or perform low-level operations. Despite being deprecated, the RDD API is accessible via the .rdd method on Datasets and DataFrames. Discover how to check the number of partitions with the getNumPartitions method and determine partition sizes using the glom function. Explore the remaining useful operations that RDD API offers for low-level hacking and internal Spark tasks.
check out.Datasets and DataFrames
Understanding Spark's .as[T] Method: Best Practices and Defensive Programming
check out.Monitoring Airflow jobs with TIG 2: data quality metrics
In the first article on Monitoring Airflow jobs with TIG, “System Metrics”, we have seen an example of Airflow installation with a TIG stack set…
check out.Monitoring Airflow jobs with TIG 1: system metrics
Like many server applications, Airflow can – and should – be monitored for metrics and logs. In this article, we will look into the former…
check out.Databricks – Photon
The Databricks platform offers two execution engines for the clients: the standard Apache Spark (available as an open-source application) and one with Photon enhancement that…
check out.Airflow — pools and mutexes.
Although the ideal data pipeline is made of idempotent and independent tasks, there are some cases when setting up a mutex (a.k.a. part of the…
check out.Passing information between DAGs in Airflow.
There are data pipelines where you must pass some values between tasks – not complete datasets, but ~ kilobytes. This can be managed even within…
check out.Managing inter-DAG dependencies in Airflow
In the real world, data pipelines only sometimes come as a completely independent sequence of operations. Usually, they share dependences on one another, occasionally easy…
check out.