<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Amadeusz Kosik, Author at TantusData</title>
	<atom:link href="https://tantusdata.com/author/amadeusz-kosik/feed/" rel="self" type="application/rss+xml" />
	<link>https://tantusdata.com</link>
	<description>That uncovers wisdom.</description>
	<lastBuildDate>Tue, 25 Mar 2025 08:51:29 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.7.1</generator>

<image>
	<url>https://tantusdata.com/app/uploads/2023/01/cropped-Favicon-32x32.png</url>
	<title>Amadeusz Kosik, Author at TantusData</title>
	<link>https://tantusdata.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>RDD in Apache Spark</title>
		<link>https://tantusdata.com/insights/rdd-in-apache-spark/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Fri, 28 Jun 2024 07:33:40 +0000</pubDate>
				<category><![CDATA[Apache Spark]]></category>
		<category><![CDATA[data pipelines]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=2128</guid>

					<description><![CDATA[<p>Learn how to utilize the RDD API in Apache Spark to check partition details or perform low-level operations. Despite being deprecated, the RDD API is accessible via the .rdd method on Datasets and DataFrames. Discover how to check the number of partitions with the getNumPartitions method and determine partition sizes using the glom function. Explore the remaining useful operations that RDD API offers for low-level hacking and internal Spark tasks.</p>
<p>The post <a href="https://tantusdata.com/insights/rdd-in-apache-spark/">RDD in Apache Spark</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="512" src="https://tantusdata.com/app/uploads/2024/06/shutterstock_2419641841-1024x512.jpg" alt="" class="wp-image-2185" srcset="https://tantusdata.com/app/uploads/2024/06/shutterstock_2419641841-1024x512.jpg 1024w, https://tantusdata.com/app/uploads/2024/06/shutterstock_2419641841-300x150.jpg 300w, https://tantusdata.com/app/uploads/2024/06/shutterstock_2419641841-768x384.jpg 768w, https://tantusdata.com/app/uploads/2024/06/shutterstock_2419641841-1536x768.jpg 1536w, https://tantusdata.com/app/uploads/2024/06/shutterstock_2419641841-2048x1024.jpg 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>Do you want to see the number of partitions? Or the partition size in rows from within the job? Or maybe you just like some low-level hacking? The RDD API is still there and accessible via the <em>.rdd</em> method.&nbsp;</p>



<h2 class="wp-block-heading">Why is RDD API still around?</h2>



<p>Despite deprecating the RDD API, the engine of Apache Spark (at least its Open Source part &#8211; see our article on Photon), the RDDs are still used in its internals. It is also available via the .<em>rdd</em> method of <em>Datasets</em>, and, therefore, <em>DataFrames</em> as well. Keep in mind, though, that the number of actually useful operations not available from Dataset API is really low, and currently, excluding some low-level or Spark internals hacking, it boils down to partitions count and size checking &#8211; using getNumPartitions method or glom operator, respectively.</p>



<h2 class="wp-block-heading">Check number of partitions</h2>



<p>RDD API still keeps the <a href="https://spark.apache.org/docs/3.2.0/api/scala/org/apache/spark/rdd/RDD.html#getNumPartitions:Int"><em>getNumPartitions</em></a> method for that use case:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="println(inputData.rdd.getNumPartitions)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #88C0D0">println</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">inputData</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">rdd</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">getNumPartitions</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<h2 class="wp-block-heading">Check number of rows per partition</h2>



<p>The <a href="https://spark.apache.org/docs/3.2.0/api/scala/org/apache/spark/rdd/RDD.html#glom():org.apache.spark.rdd.RDD[Array[T]]"><em>glom</em></a> function coalesces all rows in each partition into an array. It can be used to check the number of rows per partition. In the case of wide rows, consider using <em>select</em> to limit the number of columns—RDD are not optimized by most of Spark’s mechanisms.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="inputData.rdd.glom().map(_.length).collect().foreach(println _)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">inputData</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">rdd</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">glom</span><span style="color: #D8DEE9FF">()</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">map</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">_</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">length)</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">collect</span><span style="color: #D8DEE9FF">()</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">foreach</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">println</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">_</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>
<p>The post <a href="https://tantusdata.com/insights/rdd-in-apache-spark/">RDD in Apache Spark</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Datasets and DataFrames</title>
		<link>https://tantusdata.com/insights/dtasets-and-dataframes/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Tue, 11 Jun 2024 06:41:52 +0000</pubDate>
				<category><![CDATA[Apache Spark]]></category>
		<category><![CDATA[data pipelines]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=2105</guid>

					<description><![CDATA[<p>Understanding Spark's .as[T] Method: Best Practices and Defensive Programming</p>
<p>The post <a href="https://tantusdata.com/insights/dtasets-and-dataframes/">Datasets and DataFrames</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="576" src="https://tantusdata.com/app/uploads/2024/06/shutterstock_2301467093-1024x576.jpg" alt="" class="wp-image-2106" srcset="https://tantusdata.com/app/uploads/2024/06/shutterstock_2301467093-1024x576.jpg 1024w, https://tantusdata.com/app/uploads/2024/06/shutterstock_2301467093-300x169.jpg 300w, https://tantusdata.com/app/uploads/2024/06/shutterstock_2301467093-768x432.jpg 768w, https://tantusdata.com/app/uploads/2024/06/shutterstock_2301467093-1536x864.jpg 1536w, https://tantusdata.com/app/uploads/2024/06/shutterstock_2301467093-2048x1152.jpg 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>With the deprecation of the public use of the old good RDD API, Spark users are left with two options: typed <em>Datasets</em>and untyped <em>DataFrames </em>(that are actually a specific case of Datasets). The API also allows users to freely cast one to another &#8211; e.g. using the <em>.as[T]</em> method to cast an untyped <em>DataFrame</em> to a <em>Dataset[T]</em>. It does not change the underlying data though and can result in surprising results if one is not aware of that.</p>



<h2 class="wp-block-heading">What does <em>.as[T]</em> do?</h2>



<p>Let’s start by looking at the source (code):</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:
When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark. sql. caseSensitive).
When U is a tuple, the columns will be mapped by ordinal (i. e. the first column will be assigned to _1).
When U is a primitive type (i. e. String, Int, etc), then the first column of the DataFrame will be used.
If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
Note that as[] only changes the view of the data that is passed into typed operations, such as map(), and does not eagerly project away any columns that are not present in the specified class.
Since:
1.6.0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">Returns</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">new</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Dataset</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">where</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">each</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">record</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">has</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">been</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">mapped</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">on</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">specified</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">type</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">The</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">method</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">used</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">map</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">columns</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">depend</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">on</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">type</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">U</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9">When</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">U</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">class</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">fields</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">class</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">will</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">be</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">mapped</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">columns</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">same</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">name</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">case</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sensitivity</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">determined</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">spark</span><span style="color: #D8DEE9FF">. </span><span style="color: #8FBCBB">sql</span><span style="color: #D8DEE9FF">. </span><span style="color: #8FBCBB">caseSensitive</span><span style="color: #D8DEE9FF">).</span></span>
<span class="line"><span style="color: #8FBCBB">When</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">U</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tuple</span><span style="color: #D8DEE9FF">, </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">columns</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">will</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">be</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">mapped</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ordinal</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF">. </span><span style="color: #8FBCBB">e</span><span style="color: #D8DEE9FF">. </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">first</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">column</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">will</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">be</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">assigned</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">_1</span><span style="color: #D8DEE9FF">).</span></span>
<span class="line"><span style="color: #8FBCBB">When</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">U</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">primitive</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">type</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF">. </span><span style="color: #8FBCBB">e</span><span style="color: #D8DEE9FF">. </span><span style="color: #8FBCBB">String</span><span style="color: #D8DEE9FF">, </span><span style="color: #8FBCBB">Int</span><span style="color: #D8DEE9FF">, </span><span style="color: #8FBCBB">etc</span><span style="color: #D8DEE9FF">), </span><span style="color: #8FBCBB">then</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">first</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">column</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">will</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">be</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">used</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #8FBCBB">If</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">schema</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">does</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">match</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">desired</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">U</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">type</span><span style="color: #D8DEE9FF">, </span><span style="color: #8FBCBB">you</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">can</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">use</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">select</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">along</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">alias</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">or</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">rearrange</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">or</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">rename</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">required</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #8FBCBB">Note</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">that</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF">[] </span><span style="color: #8FBCBB">only</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">changes</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">view</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">that</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">passed</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">typed</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">, </span><span style="color: #8FBCBB">such</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">map</span><span style="color: #D8DEE9FF">(), </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">does</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">eagerly</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">project</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">away</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">any</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">columns</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">that</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">are</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">present</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">specified</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">class</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #8FBCBB">Since</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">1.6.0</span></span></code></pre></div>



<p></p>



<p>The last note is crucial: casting to a <em>Dataset</em> does not change the underlying data. Any columns not present in the <em>T</em> (e.g., columns without a corresponding field in the case class) will not be discarded.</p>



<h2 class="wp-block-heading">Why bother?</h2>



<p>There are a few situations where having extra columns may be surprising and create problems with a job run (or even worse &#8211; silently introduce data quality issues):</p>



<ul class="wp-block-list">
<li>running <em>union</em> or <em>unionAll</em> transformations on non-aligned data,</li>



<li>calling <em>distinct</em> (it will check for hidden columns’ uniqueness as well),</li>



<li>saving data (will include extra columns).</li>
</ul>



<h2 class="wp-block-heading">A defensive version of <em>.as[T]</em>&nbsp;</h2>



<p>The simple version of a defensive (meaning: adjusting the schema to the provided domain class) would be one with a <em>.select()</em> transformation call:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="case class Artist(id: String, name: String, location: String)

def toArtistsDefensive(input: DataFrame): Dataset[Artist] = { input
  .select(&quot;id&quot;, &quot;name&quot;, &quot;location&quot;)
  .as[Artist]
}" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">case</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">class</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Artist</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">id</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">String</span><span style="color: #D8DEE9FF">, </span><span style="color: #8FBCBB">name</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">String</span><span style="color: #D8DEE9FF">, </span><span style="color: #8FBCBB">location</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">String</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">toArtistsDefensive</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">input</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF">): </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">[</span><span style="color: #8FBCBB">Artist</span><span style="color: #D8DEE9FF">] = </span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF"> input</span></span>
<span class="line"><span style="color: #D8DEE9FF">  </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">select</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">id</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">location</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">  </span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">as</span><span style="color: #D8DEE9FF">[</span><span style="color: #D8DEE9">Artist</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #ECEFF4">}</span></span></code></pre></div>



<p></p>



<p>This is a very DRY-unfriendly implementation, as each modification of the <em>Artists</em> class requires searching for all related <em>select</em> instances and updating them. Fortunately, with a bit of reflection, it can be refactored into a generic solution. This generic transformation will trim the Dataset to contain only the expected columns.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="import scala.reflect.runtime.universe._

def toTDefensive[T &lt;: Product: TypeTag](input: DataFrame): Dataset[T] = { 
  val caseClassFields = typeOf[T].members
    .collect { case m: MethodSymbol if m.isCaseAccessor =&gt; m.name.toString }
    .toSeq
  
  val columns = caseClassFields
    .map(F.col _)
    .reverse

  input
    .select(columns: _*)
    .as[T]
}" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">scala</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">reflect</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">runtime</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">universe</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">_</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">toTDefensive</span><span style="color: #D8DEE9FF">[</span><span style="color: #8FBCBB">T</span><span style="color: #D8DEE9FF"> &lt;: </span><span style="color: #8FBCBB">Product</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">TypeTag</span><span style="color: #D8DEE9FF">](</span><span style="color: #8FBCBB">input</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF">): </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">[</span><span style="color: #8FBCBB">T</span><span style="color: #D8DEE9FF">] = </span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">  </span><span style="color: #8FBCBB">val</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">caseClassFields</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">typeOf</span><span style="color: #D8DEE9FF">[</span><span style="color: #8FBCBB">T</span><span style="color: #D8DEE9FF">].</span><span style="color: #8FBCBB">members</span></span>
<span class="line"><span style="color: #D8DEE9FF">    .</span><span style="color: #8FBCBB">collect</span><span style="color: #D8DEE9FF"> { </span><span style="color: #8FBCBB">case</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">m</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">MethodSymbol</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">m</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">isCaseAccessor</span><span style="color: #D8DEE9FF"> =&gt; </span><span style="color: #8FBCBB">m</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">name</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">toString</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">    .</span><span style="color: #8FBCBB">toSeq</span></span>
<span class="line"><span style="color: #D8DEE9FF">  </span></span>
<span class="line"><span style="color: #D8DEE9FF">  </span><span style="color: #8FBCBB">val</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">columns</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">caseClassFields</span></span>
<span class="line"><span style="color: #D8DEE9FF">    .</span><span style="color: #8FBCBB">map</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">F</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">col</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">_</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    .</span><span style="color: #8FBCBB">reverse</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">  </span><span style="color: #8FBCBB">input</span></span>
<span class="line"><span style="color: #D8DEE9FF">    .</span><span style="color: #8FBCBB">select</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">columns</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">_</span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    .</span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF">[</span><span style="color: #8FBCBB">T</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #D8DEE9FF">}</span></span></code></pre></div>



<p></p>
<p>The post <a href="https://tantusdata.com/insights/dtasets-and-dataframes/">Datasets and DataFrames</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Monitoring Airflow jobs with TIG 2: data quality metrics</title>
		<link>https://tantusdata.com/insights/monitoring-airflow-jobs-with-tig-2-data-quality-metrics/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Tue, 30 Apr 2024 13:31:53 +0000</pubDate>
				<category><![CDATA[data pipelines]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[monitoring]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1914</guid>

					<description><![CDATA[<p>In the first article on Monitoring Airflow jobs with TIG, &#8220;System Metrics&#8221;, we have seen an example of Airflow installation with a TIG stack set up to monitor it. To fully utilize this stack, we should enrich the raw system metrics with statistics on the processed data. Without this, the metrics would tell if the [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/monitoring-airflow-jobs-with-tig-2-data-quality-metrics/">Monitoring Airflow jobs with TIG 2: data quality metrics</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="585" src="https://tantusdata.com/app/uploads/2024/02/airflow_jobs-1-1024x585.jpg" alt="" class="wp-image-1919" srcset="https://tantusdata.com/app/uploads/2024/02/airflow_jobs-1-1024x585.jpg 1024w, https://tantusdata.com/app/uploads/2024/02/airflow_jobs-1-300x171.jpg 300w, https://tantusdata.com/app/uploads/2024/02/airflow_jobs-1-768x439.jpg 768w, https://tantusdata.com/app/uploads/2024/02/airflow_jobs-1-1536x878.jpg 1536w, https://tantusdata.com/app/uploads/2024/02/airflow_jobs-1.jpg 1792w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>In the first article on Monitoring Airflow jobs with TIG, &#8220;System Metrics&#8221;, we have seen an example of Airflow installation with a TIG stack set up to monitor it. To fully utilize this stack, we should enrich the raw system metrics with statistics on the processed data. Without this, the metrics would tell <em>if</em> the data pipelines are doing anything but not whether they are working on the <em>correct data</em>.</p>



<h2 class="wp-block-heading">What to look for?</h2>



<p>What can be realistically monitored is a pretty deep topic without a one-fits-all answer. The safe starting point is to look for the size of the data, duplicates (or unique rows), null/missing columns’ values and basic aggregates (count per some enumerated type or min/max values). The nice part of this issue is it is not limited by any software, and you can report any numeric value into an InfluxDB database.</p>



<p>Equally important is not limiting the monitoring to the output of the whole pipeline only. Being able to check the data volume and basic traits on the input and in intermediate steps is crucial, as it enables one to check, identify and react to problems early on (and avoid painful backtracking and recomputing of the whole pipeline).</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="515" src="https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-2-1024x515.png" alt="" class="wp-image-1917" srcset="https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-2-1024x515.png 1024w, https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-2-300x151.png 300w, https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-2-768x386.png 768w, https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-2-1536x772.png 1536w, https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-2.png 1999w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">An example data metrics dashboard showing row count, unique row count, and null rows for three steps in the imaginary data pipeline: load, process, and export. It is available to run a local demo on our GitHub.</figcaption></figure>



<h2 class="wp-block-heading">Computing the metrics</h2>



<p>Technically speaking, such monitoring requires two things in the pipeline: a code (or job) to compute the metric and a wrapper to send it to the metrics database. We did not cover the former here &#8211; it may vary from a simple SQL query run via Hive / Impala to a side output of a Spark job.</p>



<h2 class="wp-block-heading">Storing the data for graphs</h2>



<p>The second part to be done in Airflow is sending the data to the database. At the time of writing this article, the built-in InfluxDB connector allows only querying the database. Please see our demo (especially <em>the plugins</em> directory) for an example implementation of InfluxDB write. You can also use the REST API or <em>BashOperator</em> to call the influx command there.</p>



<h2 class="wp-block-heading">Merging both steps or not?</h2>



<p>Both compute and send metrics steps may be squashed into a single bash step instead of scheduling them separately and stitching them via XComs. However, the more complicated or time-consuming the calculation may be, the better the separated approach would seem. This is a decision for you to make; we provide an example of the former approach.</p>



<h2 class="wp-block-heading">Summary</h2>



<p>After the first step, the example stack has monitoring of the system, and an operator can see whether the system is working and does not have an overload or some kind of bottleneck. This step adds a base monitoring of the data quality. Adding those <em>on multiple points </em>in the data pipeline will also enable verification <em>during the processing</em> &#8211; in a centralized place (or, in this case, WebUI). Once again, a development/demo environment is available on our GitHub.</p>
<p>The post <a href="https://tantusdata.com/insights/monitoring-airflow-jobs-with-tig-2-data-quality-metrics/">Monitoring Airflow jobs with TIG 2: data quality metrics</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Monitoring Airflow jobs with TIG 1: system metrics</title>
		<link>https://tantusdata.com/insights/monitoring-airflow-jobs-with-tig-1-system-metrics/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Tue, 16 Apr 2024 10:25:34 +0000</pubDate>
				<category><![CDATA[Apache Airflow]]></category>
		<category><![CDATA[data pipelines]]></category>
		<category><![CDATA[devops]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[sysops]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1907</guid>

					<description><![CDATA[<p>Like many server applications, Airflow can – and should – be monitored for metrics and logs. In this article, we will look into the former and the integration with the TIG stack. The goal is to pull example metrics into a time series database and visualize it in a web application. This article will focus [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/monitoring-airflow-jobs-with-tig-1-system-metrics/">Monitoring Airflow jobs with TIG 1: system metrics</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="585" src="https://tantusdata.com/app/uploads/2024/04/Airflow-jobs-with-the-TIG-stack_1-1024x585.jpg" alt="" class="wp-image-2176" srcset="https://tantusdata.com/app/uploads/2024/04/Airflow-jobs-with-the-TIG-stack_1-1024x585.jpg 1024w, https://tantusdata.com/app/uploads/2024/04/Airflow-jobs-with-the-TIG-stack_1-300x171.jpg 300w, https://tantusdata.com/app/uploads/2024/04/Airflow-jobs-with-the-TIG-stack_1-768x439.jpg 768w, https://tantusdata.com/app/uploads/2024/04/Airflow-jobs-with-the-TIG-stack_1-1536x878.jpg 1536w, https://tantusdata.com/app/uploads/2024/04/Airflow-jobs-with-the-TIG-stack_1.jpg 1792w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Like many server applications, Airflow can – and should – be monitored for metrics and logs. In this article, we will look into the former and the integration with the TIG stack. The goal is to pull example metrics into a time series database and visualize it in a web application. This article will focus on AF system metrics, like its performance, load or health. We will cover the topic of reporting the <em>data </em>metrics into the same database in an upcoming article, &#8220;TITLE&#8221;. So, stay tuned.&nbsp;</p>



<h2 class="wp-block-heading">Why?</h2>



<p>The Airflow itself finally got a <em>Cluster status</em> dashboard built-in. Therefore, it is valid to question the need for a separate dashboard that requires additional effort and maintenance. Why bother, then? There are several possible reasons to introduce a separate monitoring stack:</p>



<ul class="wp-block-list">
<li>Aggregated view: looking into one app is easier than browsing through 10 of them. TIG (or any other monitoring stack) can be used for monitoring multiple instances and apps and is easily integrated with custom ones.</li>
</ul>



<ul class="wp-block-list">
<li>Security: there is no need to give access to the AF itself (or any other application) to be able to see the metrics and pinpoint errors. It is aligned with the data mesh and any other data democratization approach.</li>



<li>Support friendly: your 1st line support has a starting point to check the status of the data processing and be able to call the right people in case of a problem.</li>
</ul>



<h2 class="wp-block-heading">The toolkit</h2>



<p>Out of the box (but with the right pip packages) Airflow supports sending its internal metrics to a statsd server. We can leverage that and set up such a server via <em>Telegraf</em> to proxy the metrics further into a time series database: InfluxDB. As a convenient UI, Grafana can be used.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1001" height="446" src="https://tantusdata.com/app/uploads/2024/02/example-architecture.png" alt="" class="wp-image-1910" srcset="https://tantusdata.com/app/uploads/2024/02/example-architecture.png 1001w, https://tantusdata.com/app/uploads/2024/02/example-architecture-300x134.png 300w, https://tantusdata.com/app/uploads/2024/02/example-architecture-768x342.png 768w" sizes="auto, (max-width: 1001px) 100vw, 1001px" /></figure>



<h2 class="wp-block-heading">TIG stack configuration</h2>



<p>Firstly, let’s configure the TIG stack to accept the metrics:</p>



<ol class="wp-block-list">
<li>Install an InfluxDB instance. Nothing fancy here.</li>



<li>Install the Telegraf. In the configuration, look for the <em>statsd</em> input (which has to be enabled) and <em>InfluxDB </em>output. The latter will require at least authentication. Note that the Telegraf supports many more inputs and allows monitoring of, e.g. system resources (CPU, memory, disk space, etc.). This is a good idea to configure in a production environment.</li>



<li>Install Grafana and configure the <em>data source</em> to point it to the InfluxDB.</li>
</ol>



<p></p>



<p>We have prepared a docker compose-based demo in the GitHub repository with a pre-configured environment. You can see there the example configuration for Telegraf (<em>telegraf.conf</em>), InfluxDB (<em>influxdb.env</em>) and Grafana (<em>grafana-provisioning/datasources</em>). Please note that this is only for dev/demo purposes, and real production environments need to be set up more securely (including the use of HTTPS and secure password handling).</p>



<h2 class="wp-block-heading">Airflow</h2>



<p>On the Airflow side, there are two significant points to be addressed.</p>



<ol class="wp-block-list">
<li>Airflow requires the <em>Apache-airflow [statsd]</em> package to have the <em>statsd </em>client available – you can do it via pip.</li>



<li>In the airflow configuration file (usually: <em>Airflow.cfg</em>), the <em>[metrics]</em> section must be configured to enable the <em>statsd</em>, point it to the Telegraf instance and prefix the metrics (very useful in case of multiple Airflow instances).</li>
</ol>



<p>There are several metrics available to be sent in Airflow. You can see the complete list in the Airflow documentation <a href="https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html">here</a>. There, you can also limit which ones are actually reported to the Telegraf. Be aware that even though most of the metrics are said to be reported in seconds, you need to validate them yourself (<a href="https://github.com/apache/airflow/issues/20804">as in this example</a>).</p>



<h2 class="wp-block-heading">Dashboards</h2>



<p>The last step is to configure Grafana and set up some dashboards. The UI offers a WYSIWYG editor that you can use to tailor it to <em>your</em> needs. The example available on the GitHub might serve as a starting point, as it shows:</p>



<ul class="wp-block-list">
<li>state of the Airflow executors (queues and tasks being run at the moment) to see whether any processing is going on,</li>



<li>state of the pools (default and the custom ones) to check potential bottlenecks if you use multiple pools,</li>



<li>task times to identify unexpected stragglers (and compare instances’ run times and find any performance challenges early on).</li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="532" src="https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-1024x532.png" alt="" class="wp-image-1912" srcset="https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-1024x532.png 1024w, https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-300x156.png 300w, https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-768x399.png 768w, https://tantusdata.com/app/uploads/2024/02/grafana-dashboard-1536x798.png 1536w, https://tantusdata.com/app/uploads/2024/02/grafana-dashboard.png 1999w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">An example dashboard, available on our GitHub. It contains metrics useful to monitor the load on the Airflow instance and identify potential overload or bottlenecks in the data pipeline processing.</figcaption></figure>



<h2 class="wp-block-heading">Summary</h2>



<p>In conclusion, for Airflow monitoring, you can use specialized metrics tools like TIG stack and aggregate all the metrics from multiple AF instances. The stack can accommodate AF system metrics and data from other applications, including your custom ones. An example of sending data quality metrics is what we will look into in the second part. There is a demo environment on our GitHub to see whether this works for you.</p>
<p>The post <a href="https://tantusdata.com/insights/monitoring-airflow-jobs-with-tig-1-system-metrics/">Monitoring Airflow jobs with TIG 1: system metrics</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Databricks &#8211; Photon</title>
		<link>https://tantusdata.com/insights/databricks-photon/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Tue, 02 Apr 2024 12:52:56 +0000</pubDate>
				<category><![CDATA[Apache Airflow]]></category>
		<category><![CDATA[DAG dependencies]]></category>
		<category><![CDATA[data pipelines]]></category>
		<category><![CDATA[job orchestration]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1903</guid>

					<description><![CDATA[<p>The Databricks platform offers two execution engines for the clients: the standard Apache Spark (available as an open-source application) and one with Photon enhancement that brings a performance improvement (as well as extra pricing). Have you ever wondered where this speedup comes from and how it affects designing Apache Spark jobs? This article is based [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/databricks-photon/">Databricks &#8211; Photon</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="585" src="https://tantusdata.com/app/uploads/2024/04/improving_performance_1-1024x585.jpg" alt="" class="wp-image-2178" srcset="https://tantusdata.com/app/uploads/2024/04/improving_performance_1-1024x585.jpg 1024w, https://tantusdata.com/app/uploads/2024/04/improving_performance_1-300x171.jpg 300w, https://tantusdata.com/app/uploads/2024/04/improving_performance_1-768x439.jpg 768w, https://tantusdata.com/app/uploads/2024/04/improving_performance_1-1536x878.jpg 1536w, https://tantusdata.com/app/uploads/2024/04/improving_performance_1.jpg 1792w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The Databricks platform offers two execution engines for the clients: the standard Apache Spark (available as an open-source application) and one with Photon enhancement that brings a performance improvement (as well as extra pricing). Have you ever wondered where this speedup comes from and how it affects designing Apache Spark jobs?</p>



<p>This article is based on Berlkey’s paper on Photon by Databricks people, as that publication is the closest to the source, as the Photon engine is not an open-source project, and its source code is not available to the public.</p>



<h2 class="wp-block-heading">The general idea</h2>



<p>In a nutshell, Photon replaces the standard (bundled) query engine available in Apache Spark, using the same API. For some operations, mostly CPU-heavy ones, the Catalyst optimizer may decide to send the job to Photon instead of using the default execution path, all for performance reasons. In other words, it is an alternative way to execute Spark DAG tasks, not skipping/reordering/reorganizing the data engine.</p>



<h2 class="wp-block-heading">SIMD</h2>



<p>The SIMD stands for <em>Single Instruction, Multiple Data</em> and is one of the ground ideas for job optimisations in Photon. With current architectures running the same operation on multiple instances (values, rows, etc.), data can be optimised via <em>vectorisation</em>, even within a single thread. Photon is said to utilise those optimisations.</p>



<h2 class="wp-block-heading">C++ and Code generation</h2>



<p>The Photon engine is implemented in C++ instead of the JVM native languages (Scala, Java or others). The source paper points to performance reasons, including ‘hitting performance ceilings’. The communication with the rest of the Apache Spark is implemented via JNI. Databricks’ internal benchmarks indicate that the performance hit due to moving data in and out of JVM is not noticeable.&nbsp;</p>



<h2 class="wp-block-heading">Internal data format</h2>



<p>Photon engine uses columnar data representation (same as, e.g. Parquet data format) instead of row data (like the rest of Apache Spark). This is due to SIMD optimizations – kernel implementation that works best on columnar data. The memory management (calling, freeing, etc) is still done via Apache Spark’s memory manager. The data is kept off-heap, so transferring from Photon to Spark does not require copying the data.</p>



<p>When a shuffle operation is necessary, Photon writes a shuffle file and uses Spark API to execute the exchange. However, the data format is not compatible with vanilla spark, and a Photon shuffle read must follow the Photon shuffle write.</p>



<h2 class="wp-block-heading">When does it help?</h2>



<p>Photon is meant to address the CPU-heavy loads. This includes joins (especially hash join) and aggregations. On the other hand, being a non-JVM implementation, Photon obviously does not support UDFs or RDD API. Exact benchmarks and precise speedups are mentioned in the source paper.</p>



<h2 class="wp-block-heading">Sources</h2>



<ul class="wp-block-list">
<li>Source paper: <a href="https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf">Photon: A Fast Query Engine for Lakehouse Systems</a></li>
</ul>



<p>I hope this helps. Moreover, if you know any other good sources, do let us know on social media so that everyone can see them.</p>
<p>The post <a href="https://tantusdata.com/insights/databricks-photon/">Databricks &#8211; Photon</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Airflow — pools and mutexes.</title>
		<link>https://tantusdata.com/insights/airflow-pools-and-mutexes/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Tue, 19 Mar 2024 13:47:32 +0000</pubDate>
				<category><![CDATA[Apache Airflow]]></category>
		<category><![CDATA[DAG dependencies]]></category>
		<category><![CDATA[data pipelines]]></category>
		<category><![CDATA[job orchestration]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1898</guid>

					<description><![CDATA[<p>Although the ideal data pipeline is made of idempotent and independent tasks, there are some cases when setting up a mutex (a.k.a. part of the job that cannot be run concurrently) is necessary. Fortunately, Airflow supports such cases and offers a few tools varying by complexity to implement such a pipeline. In this article, we [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/airflow-pools-and-mutexes/">Airflow — pools and mutexes.</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="585" src="https://tantusdata.com/app/uploads/2024/03/airflow-1024x585.png" alt="" class="wp-image-2180" srcset="https://tantusdata.com/app/uploads/2024/03/airflow-1024x585.png 1024w, https://tantusdata.com/app/uploads/2024/03/airflow-300x171.png 300w, https://tantusdata.com/app/uploads/2024/03/airflow-768x439.png 768w, https://tantusdata.com/app/uploads/2024/03/airflow-1536x878.png 1536w, https://tantusdata.com/app/uploads/2024/03/airflow.png 1792w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Although the ideal data pipeline is made of idempotent and independent tasks, there are some cases when setting up a mutex (a.k.a. part of the job that cannot be run concurrently) is necessary. Fortunately, Airflow supports such cases and offers a few tools varying by complexity to implement such a pipeline.</p>



<p>In this article, we will look at the following DAG in AF. The graph itself is relatively simple; the catch is that <em>load_1 </em>and <em>load_2</em> operators cannot have concurrently running task instances. We will look at treating loads separately and looking at them as a group.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="408" src="https://tantusdata.com/app/uploads/2024/02/Airflow-DAG-1024x408.png" alt="" class="wp-image-1899" srcset="https://tantusdata.com/app/uploads/2024/02/Airflow-DAG-1024x408.png 1024w, https://tantusdata.com/app/uploads/2024/02/Airflow-DAG-300x120.png 300w, https://tantusdata.com/app/uploads/2024/02/Airflow-DAG-768x306.png 768w, https://tantusdata.com/app/uploads/2024/02/Airflow-DAG-1536x612.png 1536w, https://tantusdata.com/app/uploads/2024/02/Airflow-DAG.png 1586w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Therefore, there are three scenarios we will look into:</p>



<ol class="wp-block-list">
<li>Only one instance of the operator can be running.</li>



<li>One instance is that the operator cannot run before the older ones are successful.</li>



<li>Only one instance of a group of operators.</li>
</ol>



<p>The examples are using annotation syntax for Airflow. Still, the concept stays the same for the 1.0 compatible approach – instead of annotation params, use them in any <em>Operator</em> class constructor.</p>



<h2 class="wp-block-heading">Mutex on an operator</h2>



<p>The first scenario is that only one operator instance can be running at a time. If there are multiple runs ready to be scheduled, it does not matter which one goes first. One solution would be using the <em>max_active_tis_per_dag</em> option with the value of 1.&nbsp;</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="@task(
   max_active_tis_per_dag=1
)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D08770">@task</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D08770">max_active_tis_per_dag</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<h2 class="wp-block-heading">Dependency on past runs</h2>



<p>The second example assumes that the <em>nth</em> batch cannot start before <em>the nth-1</em> one is completed successfully (or at least marked so in Airflow). For this use case, AF offers <em>depends_on_past</em> flag. In this case, you have to be careful and pay some attention to the state of the latest runs. One failed, upstream-failed, <em>or </em>waiting task can halt all future DAG runs.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="@task(
   depends_on_past=True
)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D08770">@task</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D08770">depends_on_past</span><span style="color: #81A1C1">=</span><span style="color: #D08770">True</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<h2 class="wp-block-heading">Pools &#8211; mutex across multiple operators</h2>



<p>The most complex approach is required if we need to group multiple operators to make them share a mutex. One way to do it is to put them into one custom pool and limit it to accommodate only one task simultaneously – either by setting a pool with 1 slot or assigning a high number of required slots to each operator.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="load_pool = Pool.create_or_update_pool(
   name=&quot;load_pool&quot;,
   slots=1,
   description=&quot;Pool for data load tasks.&quot;
)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">load_pool</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Pool</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">create_or_update_pool</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">name</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">load_pool</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">slots</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">description</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Pool for data load tasks.</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="@task(
   pool=load_pool.pool
)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D08770">@task</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D08770">pool</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">load_pool</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">pool</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<h2 class="wp-block-heading">Source</h2>



<p>The source code for all examples and the docker environment to run them is available on GitHub.</p>
<p>The post <a href="https://tantusdata.com/insights/airflow-pools-and-mutexes/">Airflow — pools and mutexes.</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Passing information between DAGs in Airflow.</title>
		<link>https://tantusdata.com/insights/passing-information-between-dags-in-airflow/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Thu, 08 Feb 2024 13:34:54 +0000</pubDate>
				<category><![CDATA[Apache Airflow]]></category>
		<category><![CDATA[DAG dependencies]]></category>
		<category><![CDATA[data pipelines]]></category>
		<category><![CDATA[job orchestration]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1893</guid>

					<description><![CDATA[<p>There are data pipelines where you must pass some values between tasks &#8211; not complete datasets, but ~ kilobytes. This can be managed even within the Airflow itself. As always, multiple options are available &#8211; let’s review some of them. In this article, we are looking at sharing data between DAGs, which are connected via [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/passing-information-between-dags-in-airflow/">Passing information between DAGs in Airflow.</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="585" src="https://tantusdata.com/app/uploads/2024/02/dependencies_1-1024x585.jpg" alt="" class="wp-image-2182" srcset="https://tantusdata.com/app/uploads/2024/02/dependencies_1-1024x585.jpg 1024w, https://tantusdata.com/app/uploads/2024/02/dependencies_1-300x171.jpg 300w, https://tantusdata.com/app/uploads/2024/02/dependencies_1-768x439.jpg 768w, https://tantusdata.com/app/uploads/2024/02/dependencies_1-1536x878.jpg 1536w, https://tantusdata.com/app/uploads/2024/02/dependencies_1.jpg 1792w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>There are data pipelines where you must pass some values between tasks &#8211; not complete datasets, but ~ kilobytes. This can be managed even within the Airflow itself. As always, multiple options are available &#8211; let’s review some of them.</p>



<p>In this article, we are looking at sharing data between DAGs, which are connected via run dependencies. Let’s assume that each DAG needs to be run daily, and the first DAG generates some important data for the second DAG.&nbsp;</p>



<h2 class="wp-block-heading">XCom</h2>



<p>XCom would be the first and the recommended approach. It works well with out-of-the-box features like ExternalTaskSensor:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="with DAG(
       dag_id=&quot;xcom-sink&quot;,
       schedule_interval=&quot;@daily&quot;,
       start_date=datetime(2023, 7, 1),
       catchup=True,
) as xcom_sink:
   ExternalTaskSensor(
       task_id=&quot;wait-for-dependency&quot;,
       external_dag_id=&quot;xcom-source&quot;,
       external_task_id=&quot;update-hive-table-events-triggers&quot;
   ) &gt;&gt; BashOperator(
       task_id=&quot;show-xcom&quot;,
       bash_command=&quot;echo {{ ti.xcom_pull(dag_id='xcom-source', task_ids='update-hive-table-events-triggers') }}&quot;
   )" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">dag_id</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">xcom-sink</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">schedule_interval</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">start_date</span><span style="color: #81A1C1">=</span><span style="color: #88C0D0">datetime</span><span style="color: #D8DEE9FF">(</span><span style="color: #B48EAD">2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">catchup</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> xcom_sink:</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #88C0D0">ExternalTaskSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">task_id</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-dependency</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">external_dag_id</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">xcom-source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">external_task_id</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table-events-triggers</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">   ) </span><span style="color: #81A1C1">&gt;&gt;</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">BashOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">task_id</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">show-xcom</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">bash_command</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">echo {{ ti.xcom_pull(dag_id=&#39;xcom-source&#39;, task_ids=&#39;update-hive-table-events-triggers&#39;) }}</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">   )</span></span></code></pre></div>



<p></p>



<h2 class="wp-block-heading">XCom vs run id</h2>



<p>The XCom is identified by the DAG ID, task ID and run ID. If you want to run a single DAG with a custom run ID, you have to ensure there is an XCom value for that run ID already created. This complicates issuing manual runs.</p>



<h2 class="wp-block-heading">When not to XCom?</h2>



<p>Rather than discussing cases that are tailored to use XCom, let’s focus on examples that are not well supported. Basically, XCom does not work well with Datasets, and you might get some quirky results here: <a href="https://github.com/apache/airflow/discussions/33069">https://github.com/apache/airflow/discussions/33069</a>.&nbsp;</p>



<p>With datasets, you need to refer to the <em>last past value</em> of XCom, effectively losing all benefits of tight coupling. Using that parameter, you will need to deal with race conditions: if the not-latest sink DAG is restarted, it will receive an incorrect value from the source. Moreover, let’s consider a design with multiple sources and a single sink:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="475" src="https://tantusdata.com/app/uploads/2024/02/example-dag-1024x475.png" alt="" class="wp-image-1894" srcset="https://tantusdata.com/app/uploads/2024/02/example-dag-1024x475.png 1024w, https://tantusdata.com/app/uploads/2024/02/example-dag-300x139.png 300w, https://tantusdata.com/app/uploads/2024/02/example-dag-768x357.png 768w, https://tantusdata.com/app/uploads/2024/02/example-dag-1536x713.png 1536w, https://tantusdata.com/app/uploads/2024/02/example-dag.png 1999w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Suppose the <em>events</em> dataset gets updated twice and <em>the users&#8217;</em> dataset only once. In that case, the <em>events-with-users</em> will receive only the latest value of the former one. It is up to you to decide whether this behaviour is expected or unacceptable.</p>



<h2 class="wp-block-heading">Variables</h2>



<p>In rare cases, you can use <em>Variables</em> from Airflow to solve the issue. <em>Variables</em> are built-in mechanisms in Airflow that provide a global state (or configuration) for all DAGs to read and write. You can use it to implement <a href="https://java-design-patterns.com/patterns/registry/"><em>the Registry</em> code pattern</a>:</p>



<ol class="wp-block-list">
<li>Put a hashmap of <em>the date of run -&gt; metadata</em> into a variable</li>



<li>Use the <em>execution date</em> (named <em>logical_date</em> in the newer versions) as the hashmap key.</li>



<li>Use <em>PythonOperator</em> to update the variable. Read can be done by either Python code or templating.</li>



<li>Use sensors or datasets to schedule DAGs in the correct order.</li>
</ol>



<h2 class="wp-block-heading">Beware!</h2>



<p>Before you go with the variable route, please keep in mind that compared to XComs, variables are way more costly to maintain. You might need to keep track of the variables’ sizes, implement error handling in your <em>PythonOperators</em> and be mindful of any unsolicited changes in the variables’ values.</p>



<h2 class="wp-block-heading">External system?</h2>



<p>There is always an option of using an external data service to synchronise, similarly to using built-in variables. This may come in the form of data paths on HDFS / S3 / other storage, batch load dates in the database or such. However, this solution creates an inferior design, as you would end up with disadvantages of the variables approach and new implicit dependencies between DAGs and external services.</p>



<h2 class="wp-block-heading">Summary</h2>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">XCom + dataset</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">XCom + sensor</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Variables</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">External</mark></td></tr><tr><td>+ no tight coupling between DAGs</td><td>+ supports 1:1 task run relationships</td><td>+ works with both sensors and datasets</td><td>+ works even with 3rd party systems</td></tr><tr><td>&#8211; race conditions,<br><br>&#8211; no guarantee for 1:1 run relationships</td><td>&#8211; a bit tighter coupling<br><br>&#8211; for manual run one must supply by hand the exec_date</td><td>&#8211; handles only data sharing, does not handle orchestration<br><br>&#8211; requires way more effort than XCom</td><td>&#8211; requires even more effort<br><br>&#8211; hidden dependencies outside AF<br><br>&#8211; complicated architecture</td></tr></tbody></table><figcaption class="wp-element-caption">Comparison</figcaption></figure>



<h2 class="wp-block-heading">Source</h2>



<p>The source code for both XCom and <em>Variable</em> examples and the docker environment to run them is available on GitHub.</p>
<p>The post <a href="https://tantusdata.com/insights/passing-information-between-dags-in-airflow/">Passing information between DAGs in Airflow.</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>NPM &#038; N on MacOSX</title>
		<link>https://tantusdata.com/insights/setting-up-npm-and-n-on-macosx-a-developers-guide/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Mon, 11 Sep 2023 11:00:00 +0000</pubDate>
				<category><![CDATA[IntelliJ IDEA]]></category>
		<category><![CDATA[node.js]]></category>
		<category><![CDATA[node.js installation]]></category>
		<category><![CDATA[non-system-wide installation]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1602</guid>

					<description><![CDATA[<p>Most programming languages have their SDK available for non-system-wide installation. Python offers venv, while Java Development Kit needs only the JAVA_HOME variable to be set, etc. At first glance, this does not seem to be the case with Node.js for JavaScript, where developers are expected to install all tools system-wide, but there are ways to [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/setting-up-npm-and-n-on-macosx-a-developers-guide/">NPM &amp; N on MacOSX</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://tantusdata.com/app/uploads/2023/08/NPM-N-on-MacOSX-1024x576.jpg" alt="" class="wp-image-1763" srcset="https://tantusdata.com/app/uploads/2023/08/NPM-N-on-MacOSX-1024x576.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/NPM-N-on-MacOSX-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/08/NPM-N-on-MacOSX-768x432.jpg 768w, https://tantusdata.com/app/uploads/2023/08/NPM-N-on-MacOSX-1536x864.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/NPM-N-on-MacOSX-2048x1152.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Most programming languages have their SDK available for non-system-wide installation. Python offers venv, while Java Development Kit needs only the JAVA_HOME variable to be set, etc. At first glance, this does not seem to be the case with Node.js for JavaScript, where developers are expected to install all tools system-wide, but there are ways to achieve it. One is to use a tool called n &#8211; a simple-to-use wrapper and manager for node.js, as described below.</p>



<h2 class="wp-block-heading">Installation</h2>



<p>Three short steps that work for MacOSX:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="# Install n

brew install n

# Set N_PREFIX to avoid writing into system directories

export N_PREFIX=$HOME/.n

export PATH=$PATH:$N_PREFIX/bin

# Install latest node.js

n latest" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Install</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">n</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">brew</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">install</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">n</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Set</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">N_PREFIX</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">avoid</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">writing</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">system</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">directories</span></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">export</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">N_PREFIX</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">$HOME</span><span style="color: #81A1C1">/</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">n</span></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">export</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PATH</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF">$PATH</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9">$N_PREFIX</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">bin</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Install</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">latest</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">node</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">js</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">n</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">latest</span></span></code></pre></div>



<p></p>



<p>The full installation guide is available on the project’s GitHub page: (<a href="https://github.com/tj/n#third-party-installers">https://github.com/tj/n#third-party-installers</a>).&nbsp;</p>



<h2 class="wp-block-heading">IntelliJ IDEA</h2>



<p>I want to use an IDE to work with node.js apps. Unfortunately, editors like IntelliJ fail to find the node command that is not on the system $PATH. One can work around this issue and use the terminal to execute the framework’s command. However, there is also a more comfortable approach &#8211; updating the environment variables.</p>



<h3 class="wp-block-heading">N_PREFIX</h3>



<p>This one is pretty simple: create or update the `/etc/launchd.conf` file (source: <a href="https://stackoverflow.com/questions/135688/setting-environment-variables-on-os-x">https://stackoverflow.com/questions/135688/setting-environment-variables-on-os-x</a>):&nbsp;</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="setenv N_PREFIX &lt;absolute_path_to_n_prefix&gt;" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">setenv</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">N_PREFIX</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">&lt;</span><span style="color: #D8DEE9">absolute_path_to_n_prefix</span><span style="color: #81A1C1">&gt;</span></span></code></pre></div>



<p></p>



<h3 class="wp-block-heading">PATH</h3>



<p>In case of <em>$PATH</em> variable there is a different way to set it than using launchd:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="# Create a file in /etc/paths.d/ with any file name

# and value of directory you want to append to $PATH

echo &quot;/path/to/.n/bin&quot; | sudo tee -a /etc/paths.d/n" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">file</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">etc</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">paths</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">d</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">any</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">file</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">name</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">value</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">directory</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">you</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">want</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">append</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">$PATH</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">echo</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/path/to/.n/bin</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">|</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">sudo</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">tee</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">etc</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">paths</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">d</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">n</span></span></code></pre></div>



<p></p>
<p>The post <a href="https://tantusdata.com/insights/setting-up-npm-and-n-on-macosx-a-developers-guide/">NPM &amp; N on MacOSX</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Managing inter-DAG dependencies in Airflow</title>
		<link>https://tantusdata.com/insights/managing-inter-dag-dependencies-in-airflow/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Fri, 01 Sep 2023 11:00:00 +0000</pubDate>
				<category><![CDATA[Apache Airflow]]></category>
		<category><![CDATA[DAG dependencies]]></category>
		<category><![CDATA[data pipelines]]></category>
		<category><![CDATA[data-aware scheduling]]></category>
		<category><![CDATA[performance tuning]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1731</guid>

					<description><![CDATA[<p>In the real world, data pipelines only sometimes come as a completely independent sequence of operations. Usually, they share dependences on one another, occasionally easy 1:1 ones, sometimes more complicated. Here is a short list of what Apache Airflow has to offer for handling those relationships. The general setting We will look into a situation [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/managing-inter-dag-dependencies-in-airflow/">Managing inter-DAG dependencies in Airflow</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://tantusdata.com/app/uploads/2023/08/TantusData.interDag.dependencies-copy-1024x576.jpg" alt="" class="wp-image-1732" srcset="https://tantusdata.com/app/uploads/2023/08/TantusData.interDag.dependencies-copy-1024x576.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/TantusData.interDag.dependencies-copy-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/08/TantusData.interDag.dependencies-copy-768x432.jpg 768w, https://tantusdata.com/app/uploads/2023/08/TantusData.interDag.dependencies-copy-1536x864.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/TantusData.interDag.dependencies-copy-2048x1152.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>In the real world, data pipelines only sometimes come as a completely independent sequence of operations. Usually, they share dependences on one another, occasionally easy 1:1 ones, sometimes more complicated. Here is a short list of what Apache Airflow has to offer for handling those relationships.</p>



<h1 class="wp-block-heading">The general setting</h1>



<p>We will look into a situation where parent data pipelines (called DAGs in Airflowish) create data used (consumed) by children pipelines. The dependency looks like this:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="300" src="https://tantusdata.com/app/uploads/2023/08/Graph.1-copy-1024x300.jpg" alt="" class="wp-image-1759" srcset="https://tantusdata.com/app/uploads/2023/08/Graph.1-copy-1024x300.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/Graph.1-copy-300x88.jpg 300w, https://tantusdata.com/app/uploads/2023/08/Graph.1-copy-768x225.jpg 768w, https://tantusdata.com/app/uploads/2023/08/Graph.1-copy-1536x450.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/Graph.1-copy-2048x600.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The silent assumption is that we cannot just look at the metadata (file creation time, last row’s created timestamp or alike), as it might show that data is being loaded but not exactly fully loaded &#8211; therefore, we are waiting for the entire dataset to be available.</p>



<h1 class="wp-block-heading">External tools</h1>



<p>Using an external tool is always available, not only for Airflow data pipelines. The most straightforward implementation would be using an 0-byte file on NFS, HDFS or any other shareable filesystem to mark success for a given dataset.&nbsp;&nbsp;</p>



<p>This example is based on HDFS, but you could also use:</p>



<ul class="wp-block-list">
<li>NFS,</li>



<li>FTP/SFTP location,</li>



<li>metadata stored in a SQL or NoSQL database,</li>



<li>any other accessible non-airflow related data storage.</li>
</ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="475" src="https://tantusdata.com/app/uploads/2023/08/graph.2-1-1024x475.jpg" alt="" class="wp-image-1751" srcset="https://tantusdata.com/app/uploads/2023/08/graph.2-1-1024x475.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/graph.2-1-300x139.jpg 300w, https://tantusdata.com/app/uploads/2023/08/graph.2-1-768x356.jpg 768w, https://tantusdata.com/app/uploads/2023/08/graph.2-1-1536x713.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/graph.2-1-2048x950.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">Source code</h2>



<p>For the current (4.1.0) version of the apache-airflow-providers-apache-hdfs, there is no HDFS operator; only sensors are available. Therefore a bit of walkaround (e.g. BashOperator) has to be used to create a marker file.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="from __future__ import annotations


from datetime import datetime


from airflow.models import DAG
from airflow.operators.bash import BashOperator
from airflow.providers.apache.hdfs.sensors.web_hdfs import WebHdfsSensor
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.providers.sftp.sensors.sftp import SFTPSensor


HDFS_CONNECTION_ID = &quot;hdfs-14&quot;
SFTP_CONNECTION_ID = &quot;sftp-21&quot;


with DAG(
	dag_id=&quot;external-events&quot;,
	schedule=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as dag_events:
	SFTPSensor(
    		task_id=&quot;wait-for-csv&quot;,
	    	path=&quot;/inbox/events/{{ dag_run.logical_date | ds }}.csv&quot;,
    		sftp_conn_id=SFTP_CONNECTION_ID
	) &gt;&gt; SparkSubmitOperator(
	    	task_id=&quot;csv-to-parquet&quot;,
    		application=&quot;/spark/applications/csv-to-parquet.py&quot;
	) &gt;&gt; SparkSubmitOperator(
	    	task_id=&quot;update-hive-table&quot;,
    		application=&quot;/spark/applications/update-hive-table.py&quot;
	) &gt;&gt; BashOperator(
	    	task_id=&quot;create-hdfs-marker-file&quot;,
    		bash_command=&quot;hdfs dfs -touch /marker/events/{{ dag_run.logical_date | ds }}.success&quot;
	)


with DAG(
	dag_id=&quot;external-users&quot;,
	schedule=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as dag_users:
	SFTPSensor(
    		task_id=&quot;wait-for-csv&quot;,
	    	path=&quot;/inbox/users/{{ dag_run.logical_date | ds }}.csv&quot;,
    		sftp_conn_id=SFTP_CONNECTION_ID
	) &gt;&gt; SparkSubmitOperator(
	    	task_id=&quot;csv-to-parquet&quot;,
    		application=&quot;/spark/applications/csv-to-parquet.py&quot;
	) &gt;&gt; SparkSubmitOperator(
	    	task_id=&quot;update-hive-table&quot;,
    		application=&quot;/spark/applications/update-hive-table.py&quot;
	) &gt;&gt; BashOperator(
	    	task_id=&quot;create-hdfs-marker-file&quot;,
    		bash_command=&quot;hdfs dfs -touch /marker/users/{{ dag_run.logical_date | ds }}.success&quot;
	)


with DAG(
	dag_id=&quot;external-events-with-users&quot;,
	schedule=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as dag_events_with_users:
	[
    	WebHdfsSensor(
        	task_id=&quot;wait-for-marker-events&quot;,
        	filepath=&quot;/marker/events/{{ dag_run.logical_date | ds }}.success&quot;,
        	webhdfs_conn_id=HDFS_CONNECTION_ID
    	),
    	WebHdfsSensor(
        	task_id=&quot;wait-for-marker-users&quot;,
        	filepath=&quot;/marker/users/{{ dag_run.logical_date | ds }}.success&quot;,
        	webhdfs_conn_id=HDFS_CONNECTION_ID
    	)
	] &gt;&gt; SparkSubmitOperator(
	    	task_id=&quot;compute-events-with-users&quot;,
    		application=&quot;/spark/applications/compute-events-with-users.py&quot;
	) &gt;&gt; SparkSubmitOperator(
	    	task_id=&quot;update-hive-table&quot;,
    		application=&quot;/spark/applications/update-hive-table.py&quot;
	) &gt;&gt; BashOperator(
	    	task_id=&quot;create-hdfs-marker-file&quot;,
    		bash_command=&quot;hdfs dfs -touch /marker/events-with-users/{{ dag_run.logical_date | ds }}.success&quot;
	)


with DAG(
	dag_id=&quot;external-reports-1&quot;,
	schedule=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as dag_reports_1:
	WebHdfsSensor(
	    	task_id=&quot;wait-for-events-with-users&quot;,
    		filepath=&quot;/marker/events-with-users/{{ dag_run.logical_date | ds }}.success&quot;,
	    	webhdfs_conn_id=HDFS_CONNECTION_ID
	) &gt;&gt; SparkSubmitOperator(
	    	task_id=&quot;compute-report&quot;,
    		application=&quot;/spark/applications/compute-report-1.py&quot;
	)


with DAG(
	dag_id=&quot;external-reports-2&quot;,
	schedule=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as dag_reports_2:
	WebHdfsSensor(
    		task_id=&quot;wait-for-events-with-users&quot;,
	    	filepath=&quot;/marker/events-with-users/{{ dag_run.logical_date | ds }}.success&quot;,
    		webhdfs_conn_id=HDFS_CONNECTION_ID
	) &gt;&gt; SparkSubmitOperator(
    		task_id=&quot;compute-report&quot;,
	    	application=&quot;/spark/applications/compute-report-2.py&quot;
	)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__future__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">annotations</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">models</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">bash</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BashOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">providers</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">apache</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">hdfs</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">web_hdfs</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">WebHdfsSensor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">providers</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">apache</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">spark</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">spark_submit</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SparkSubmitOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">providers</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sftp</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sftp</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SFTPSensor</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">HDFS_CONNECTION_ID</span><span style="color: #D8DEE9FF"> = </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hdfs-14</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span><span style="color: #D8DEE9FF"> = </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sftp-21</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">external-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_events</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SFTPSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">path</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/inbox/events/{{ dag_run.logical_date | ds }}.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">sftp_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">csv-to-parquet</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/csv-to-parquet.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">create-hdfs-marker-file</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hdfs dfs -touch /marker/events/{{ dag_run.logical_date | ds }}.success</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">external-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_users</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SFTPSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">path</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/inbox/users/{{ dag_run.logical_date | ds }}.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">sftp_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">csv-to-parquet</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/csv-to-parquet.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">create-hdfs-marker-file</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hdfs dfs -touch /marker/users/{{ dag_run.logical_date | ds }}.success</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">external-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_events_with_users</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	[</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">WebHdfsSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-marker-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">filepath</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/marker/events/{{ dag_run.logical_date | ds }}.success</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">webhdfs_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">HDFS_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">WebHdfsSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-marker-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">filepath</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/marker/users/{{ dag_run.logical_date | ds }}.success</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">webhdfs_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">HDFS_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	] &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-events-with-users.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">create-hdfs-marker-file</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hdfs dfs -touch /marker/events-with-users/{{ dag_run.logical_date | ds }}.success</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">external-reports-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_reports_1</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">WebHdfsSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">filepath</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/marker/events-with-users/{{ dag_run.logical_date | ds }}.success</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">webhdfs_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">HDFS_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-report</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-report-1.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">external-reports-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_reports_2</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">WebHdfsSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">filepath</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/marker/events-with-users/{{ dag_run.logical_date | ds }}.success</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">webhdfs_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">HDFS_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-report</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-report-2.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span></code></pre></div>



<p></p>



<p>This is a helpful technique if Airflow is to be integrated with other tools. On the other hand, the dependency is very implicit. It must be refactored carefully to maintain the contract between data pipelines.</p>



<h1 class="wp-block-heading">Airflow datasets</h1>



<p>Since version 2.4, Airflow has offered data-aware scheduling based on the concept of datasets. The general idea is:</p>



<p>1. Operators define datasets to which they publish data. The dataset in Airflow is just metadata &#8211; AF does not handle the data itself.</p>



<p>2. DAGs specify datasets that they depend on. Each time all dependency datasets are updated (once or many times), the dependent DAG is triggered. This setting replaces the time-based schedule.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="475" src="https://tantusdata.com/app/uploads/2023/08/graph.3-1024x475.jpg" alt="" class="wp-image-1753" srcset="https://tantusdata.com/app/uploads/2023/08/graph.3-1024x475.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/graph.3-300x139.jpg 300w, https://tantusdata.com/app/uploads/2023/08/graph.3-768x356.jpg 768w, https://tantusdata.com/app/uploads/2023/08/graph.3-1536x713.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/graph.3-2048x950.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Although this feature does not give any control over the time of the execution nor tie specific runs (there is no guarantee that DAG gets run on every separate dependency run &#8211; several runs may get squashed into a single dependent run), a user receives out-of-the-box a friendly UI for browsing datasets and DAGs relationships.&nbsp;</p>



<h2 class="wp-block-heading">Source code</h2>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="from __future__ import annotations


from airflow import DAG, Dataset
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.providers.sftp.sensors.sftp import SFTPSensor
from pendulum import datetime




dataset_users = Dataset(&quot;af://datasets/users&quot;)
dataset_events = Dataset(&quot;af://datasets/events&quot;)
dataset_events_with_users = Dataset(&quot;af://datasets/events-with-users&quot;)
dataset_reports_1 = Dataset(&quot;af://datasets/reports-1&quot;)
dataset_reports_2 = Dataset(&quot;af://datasets/reports-2&quot;)


SFTP_CONNECTION_ID = &quot;sftp-21&quot;


with DAG(
	dag_id=&quot;dataset-events&quot;,
	catchup=True,
	start_date=datetime(2023, 7, 1, tz=&quot;UTC&quot;),
	schedule=&quot;@daily&quot;,
	max_active_runs=1,
) as dag_producer_events:
	SFTPSensor(
        	task_id=&quot;wait-for-csv&quot;,
    	    	path=&quot;/inbox/events/{{ dag_run.logical_date | ds }}.csv&quot;,
    	    	sftp_conn_id=SFTP_CONNECTION_ID
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;csv-to-parquet&quot;,
    	    	application=&quot;/spark/applications/csv-to-parquet.py&quot;
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;update-hive-table&quot;,
    	    	application=&quot;/spark/applications/update-hive-table.py&quot;,
    	    	outlets=[dataset_events]
	)


with DAG(
	dag_id=&quot;dataset-users&quot;,
	catchup=True,
	start_date=datetime(2023, 7, 1, tz=&quot;UTC&quot;),
	schedule=&quot;@daily&quot;,
	max_active_runs=1,
) as dag_producer_users:
	SFTPSensor(
        	task_id=&quot;wait-for-csv&quot;,
    	    	path=&quot;/inbox/users/{{ dag_run.logical_date | ds }}.csv&quot;,
    	    	sftp_conn_id=SFTP_CONNECTION_ID
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;csv-to-parquet&quot;,
    	    	application=&quot;/spark/applications/csv-to-parquet.py&quot;
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;update-hive-table&quot;,
    	    	application=&quot;/spark/applications/update-hive-table.py&quot;,
    	    	outlets=[dataset_users]
	)


with DAG(
	dag_id=&quot;dataset-events-with-users&quot;,
	catchup=True,
	start_date=datetime(2023, 7, 1, tz=&quot;UTC&quot;),
	schedule=[dataset_events, dataset_users],
	max_active_runs=1,
) as dag_processor_events_with_users:
	SparkSubmitOperator(
    	    	ask_id=&quot;compute-events-with-users&quot;,
    	    	application=&quot;/spark/applications/compute-events-with-users.py&quot;
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;update-hive-table&quot;,
    	    	application=&quot;/spark/applications/update-hive-table.py&quot;,
    	    	outlets=[dataset_events_with_users]
	)


with DAG(
	dag_id=&quot;dataset-reports-1&quot;,
	catchup=True,
	start_date=datetime(2023, 7, 1, tz=&quot;UTC&quot;),
	schedule=[dataset_events_with_users],
	max_active_runs=1,
) as dag_processor_reports_1:
	SparkSubmitOperator(
    	    	task_id=&quot;compute-report&quot;,
    	    	application=&quot;/spark/applications/compute-report-1.py&quot;,
    	    	outlets=[dataset_reports_1]
	)


with DAG(
	dag_id=&quot;dataset-reports-2&quot;,
	catchup=True,
	start_date=datetime(2023, 7, 1, tz=&quot;UTC&quot;),
	schedule=[dataset_events_with_users],
	max_active_runs=1,
) as dag_processor_reports_2:
	SparkSubmitOperator(
    	    	task_id=&quot;compute-report&quot;,
    	    	application=&quot;/spark/applications/compute-report-2.py&quot;,
    	    	outlets=[dataset_reports_2]
	)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__future__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">annotations</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Dataset</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">providers</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">apache</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">spark</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">spark_submit</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SparkSubmitOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">providers</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sftp</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sftp</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SFTPSensor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pendulum</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">dataset_users</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">af://datasets/users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">dataset_events</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">af://datasets/events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">dataset_events_with_users</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">af://datasets/events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">dataset_reports_1</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">af://datasets/reports-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">dataset_reports_2</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">af://datasets/reports-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span><span style="color: #D8DEE9FF"> = </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sftp-21</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dataset-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tz</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">UTC</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">max_active_runs</span><span style="color: #D8DEE9FF">=1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_producer_events</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SFTPSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">path</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/inbox/events/{{ dag_run.logical_date | ds }}.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">sftp_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">csv-to-parquet</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/csv-to-parquet.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">outlets</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_events</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dataset-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tz</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">UTC</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">max_active_runs</span><span style="color: #D8DEE9FF">=1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_producer_users</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SFTPSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">path</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/inbox/users/{{ dag_run.logical_date | ds }}.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">sftp_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">csv-to-parquet</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/csv-to-parquet.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">outlets</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_users</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dataset-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tz</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">UTC</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_events</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dataset_users</span><span style="color: #D8DEE9FF">]</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">max_active_runs</span><span style="color: #D8DEE9FF">=1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_processor_events_with_users</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">ask_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-events-with-users.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">outlets</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_events_with_users</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dataset-reports-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tz</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">UTC</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_events_with_users</span><span style="color: #D8DEE9FF">]</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">max_active_runs</span><span style="color: #D8DEE9FF">=1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_processor_reports_1</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-report</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-report-1.py</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">outlets</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_reports_1</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dataset-reports-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tz</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">UTC</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_events_with_users</span><span style="color: #D8DEE9FF">]</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">max_active_runs</span><span style="color: #D8DEE9FF">=1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_processor_reports_2</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-report</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-report-2.py</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">outlets</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_reports_2</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span></code></pre></div>



<p></p>



<h1 class="wp-block-heading">Trigger the external DAG</h1>



<p>Suppose the functionality of the datasets is too limited or an older version of the AF is used. In that case, there are alternative ways to orchestrate the data pipelines. TriggerDagRunOperator represents the push approach. The idea behind this operator is to trigger a run of a specified DAG with the option of supplying custom parameters. Obviously, this feature does not allow to create a dependency on more than one source dataset. However, one dataset can still run multiple dependents.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="300" src="https://tantusdata.com/app/uploads/2023/08/graph4-1024x300.jpg" alt="" class="wp-image-1742" srcset="https://tantusdata.com/app/uploads/2023/08/graph4-1024x300.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/graph4-300x88.jpg 300w, https://tantusdata.com/app/uploads/2023/08/graph4-768x225.jpg 768w, https://tantusdata.com/app/uploads/2023/08/graph4-1536x450.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/graph4-2048x600.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Since this approach uses the DAG name (id) to identify the dependent pipeline, you must be careful when refactoring those names. One way to look up the dependent (and depending) DAGs and tasks is to use the Browse &gt; DAG Dependencies tab that shows a graph of trigger and sensor relationships.</p>



<h2 class="wp-block-heading">Source code</h2>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="from __future__ import annotations


from datetime import datetime
from airflow.models import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.providers.sftp.sensors.sftp import SFTPSensor
from airflow.utils.task_group import TaskGroup
from airflow.utils.trigger_rule import TriggerRule




SFTP_CONNECTION_ID = &quot;sftp-21&quot;




with DAG(
	dag_id=&quot;trigger-events-with-users&quot;,
	schedule=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as dag_events_with_users:
	with TaskGroup(group_id=&quot;gather-events&quot;) as group_events:
    	SFTPSensor(
        	task_id=&quot;wait-for-csv&quot;,
        	path=&quot;/inbox/events/{{ dag_run.logical_date | ds }}.csv&quot;,
        	sftp_conn_id=SFTP_CONNECTION_ID
    	) &gt;&gt; SparkSubmitOperator(
        	task_id=&quot;csv-to-parquet&quot;,
        	application=&quot;/spark/applications/csv-to-parquet.py&quot;
    	) &gt;&gt; SparkSubmitOperator(
        	task_id=&quot;update-hive-table&quot;,
        	application=&quot;/spark/applications/update-hive-table.py&quot;
    	)


	with TaskGroup(group_id=&quot;gather-users&quot;) as group_users:
    	SFTPSensor(
        	task_id=&quot;wait-for-csv&quot;,
        	path=&quot;/inbox/users/{{ dag_run.logical_date | ds }}.csv&quot;,
        	sftp_conn_id=SFTP_CONNECTION_ID
    	) &gt;&gt; SparkSubmitOperator(
        	task_id=&quot;csv-to-parquet&quot;,
        	application=&quot;/spark/applications/csv-to-parquet.py&quot;
    	) &gt;&gt; SparkSubmitOperator(
        	task_id=&quot;update-hive-table&quot;,
        	application=&quot;/spark/applications/update-hive-table.py&quot;
    	)


	[group_events, group_users] &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;compute-events-with-users&quot;,
    	    	application=&quot;/spark/applications/compute-events-with-users.py&quot;
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;update-hive-table&quot;,
    	    	application=&quot;/spark/applications/update-hive-table.py&quot;
	) &gt;&gt; [
    	    	TriggerDagRunOperator(
        	    	task_id=&quot;trigger-reports-1&quot;,
    	        	trigger_dag_id=&quot;trigger-reports-1&quot;,
        	    	trigger_rule=TriggerRule.ALL_SUCCESS,
    	    	),
    	    	TriggerDagRunOperator(
        	    	task_id=&quot;trigger-reports-2&quot;,
    	        	trigger_dag_id=&quot;trigger-reports-2&quot;,
           	 	trigger_rule=TriggerRule.ALL_SUCCESS,
    	    	)
	]


with DAG(
	dag_id=&quot;trigger-reports-1&quot;,
	catchup=True,
	start_date=datetime(2023, 7, 1),
) as dag_processor_reports_1:
	SparkSubmitOperator(
    	    	task_id=&quot;compute-report&quot;,
    	    	application=&quot;/spark/applications/compute-report-1.py&quot;,
	)


with DAG(
	dag_id=&quot;trigger-reports-2&quot;,
	catchup=True,
	start_date=datetime(2023, 7, 1),
) as dag_processor_reports_2:
	SparkSubmitOperator(
    	    	task_id=&quot;compute-report&quot;,
    	    	application=&quot;/spark/applications/compute-report-2.py&quot;,
	)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__future__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">annotations</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">models</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">bash</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BashOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">trigger_dagrun</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">TriggerDagRunOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">providers</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">apache</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">spark</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">spark_submit</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SparkSubmitOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">providers</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sftp</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sftp</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SFTPSensor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">utils</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">task_group</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">TaskGroup</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">utils</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">trigger_rule</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">TriggerRule</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span><span style="color: #D8DEE9FF"> = </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sftp-21</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">trigger-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_events_with_users</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">TaskGroup</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">group_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">gather-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">group_events</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">SFTPSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">path</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/inbox/events/{{ dag_run.logical_date | ds }}.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">sftp_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">csv-to-parquet</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/csv-to-parquet.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">TaskGroup</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">group_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">gather-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">group_users</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">SFTPSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">path</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/inbox/users/{{ dag_run.logical_date | ds }}.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">sftp_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">csv-to-parquet</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/csv-to-parquet.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	[</span><span style="color: #8FBCBB">group_events</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">group_users</span><span style="color: #D8DEE9FF">] &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-events-with-users.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; [</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">TriggerDagRunOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">trigger-reports-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	        	</span><span style="color: #8FBCBB">trigger_dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">trigger-reports-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	    	</span><span style="color: #8FBCBB">trigger_rule</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">TriggerRule</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">ALL_SUCCESS</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">TriggerDagRunOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">trigger-reports-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	        	</span><span style="color: #8FBCBB">trigger_dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">trigger-reports-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">           	 	</span><span style="color: #8FBCBB">trigger_rule</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">TriggerRule</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">ALL_SUCCESS</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	]</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">trigger-reports-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_processor_reports_1</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-report</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-report-1.py</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">trigger-reports-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_processor_reports_2</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-report</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-report-2.py</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span></code></pre></div>



<p></p>



<h1 class="wp-block-heading">The sensor for external DAG</h1>



<p>Using the sensor is exactly the opposite of triggering some other DAG. In this approach, we tell DAG to pause and wait until some other DAG (or a specific task in the DAG) is completed. We still need to specify appropriate schedules for both pipelines, however.</p>



<p>The sensor approach is very flexible. Unlike the datasets approach, each sensor instance can wait for a specific DAG/task/execution date combination, which allows waiting with specific time offset or model aggregation relationships (hourly to daily, daily to weekly, etc.). Unlike the triggering approach, a DAG can model waiting for all preconditions with the sensor approach.</p>



<p>Precisely, like when using DAG triggering, sensor dependency can be checked on the Browse &gt; DAG Dependencies tab.&nbsp;</p>



<h2 class="wp-block-heading">Source code</h2>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="from __future__ import annotations


from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.providers.sftp.sensors.sftp import SFTPSensor
from airflow.sensors.external_task import ExternalTaskSensor




SFTP_CONNECTION_ID = &quot;sftp-21&quot;




with DAG(
    	dag_id=&quot;sensor-events&quot;,
    	schedule_interval=&quot;@daily&quot;,
    	start_date=datetime(2023, 7, 1),
    	catchup=True,
) as sensor_events:
	SFTPSensor(
    	    	task_id=&quot;wait-for-csv&quot;,
    	    	path=&quot;/inbox/events/{{ dag_run.logical_date | ds }}.csv&quot;,
    	    	sftp_conn_id=SFTP_CONNECTION_ID
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;csv-to-parquet&quot;,
    	    	application=&quot;/spark/applications/csv-to-parquet.py&quot;
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;update-hive-table&quot;,
    	    	application=&quot;/spark/applications/update-hive-table.py&quot;
	)


with DAG(
    	dag_id=&quot;sensor-users&quot;,
    	schedule_interval=&quot;@daily&quot;,
    	start_date=datetime(2023, 7, 1),
    	catchup=True,
) as sensor_users:
	SFTPSensor(
    	    	task_id=&quot;wait-for-csv&quot;,
    	    	path=&quot;/inbox/users/{{ dag_run.logical_date | ds }}.csv&quot;,
    	    	sftp_conn_id=SFTP_CONNECTION_ID
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;csv-to-parquet&quot;,
    	    	application=&quot;/spark/applications/csv-to-parquet.py&quot;
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;update-hive-table&quot;,
    	    	application=&quot;/spark/applications/update-hive-table.py&quot;
	)


with DAG(
    	dag_id=&quot;sensor-events-with-users&quot;,
    	schedule_interval=&quot;@daily&quot;,
    	start_date=datetime(2023, 7, 1),
    	catchup=True,
) as dag_daily:
	[
    	    	ExternalTaskSensor(
         	   	task_id=&quot;wait-for-events&quot;,
        	    	external_dag_id=&quot;sensor-events&quot;,
    	        	check_existence=True
    	    	),
    	    	ExternalTaskSensor(
         	   	task_id=&quot;wait-for-users&quot;,
        	    	external_dag_id=&quot;sensor-users&quot;,
    	        	check_existence=True
    	    	)
	] &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;compute-events-with-users&quot;,
    	    	application=&quot;/spark/applications/compute-events-with-users.py&quot;
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;update-hive-table&quot;,
    	    	application=&quot;/spark/applications/update-hive-table.py&quot;
	)


with DAG(
    	dag_id=&quot;sensors-reports-1&quot;,
    	schedule=&quot;@daily&quot;,
    	start_date=datetime(2023, 7, 1),
) as dag_reports_1:
	ExternalTaskSensor(
        		task_id=&quot;wait-for-events-with-users&quot;,
    	    	external_dag_id=&quot;sensor-events-with-users&quot;,
    	    	check_existence=True
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;compute-report&quot;,
    	    	application=&quot;/spark/applications/compute-report-1.py&quot;
	)


with DAG(
    	dag_id=&quot;sensors-reports-2&quot;,
    	schedule=&quot;@daily&quot;,
    	start_date=datetime(2023, 7, 1),
) as dag_reports_2:
	ExternalTaskSensor(
    	    	task_id=&quot;wait-for-events-with-users&quot;,
    	    	external_dag_id=&quot;sensor-events-with-users&quot;,
    	    	check_existence=True
	) &gt;&gt; SparkSubmitOperator(
    	    	task_id=&quot;compute-report&quot;,
    	    	application=&quot;/spark/applications/compute-report-2.py&quot;
	)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__future__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">annotations</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">timedelta</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">models</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">providers</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">apache</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">spark</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">spark_submit</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SparkSubmitOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">providers</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sftp</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sftp</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SFTPSensor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">external_task</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ExternalTaskSensor</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span><span style="color: #D8DEE9FF"> = </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sftp-21</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sensor-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">schedule_interval</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sensor_events</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SFTPSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">path</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/inbox/events/{{ dag_run.logical_date | ds }}.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">sftp_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">csv-to-parquet</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/csv-to-parquet.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sensor-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">schedule_interval</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sensor_users</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">SFTPSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">path</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/inbox/users/{{ dag_run.logical_date | ds }}.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">sftp_conn_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">SFTP_CONNECTION_ID</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">csv-to-parquet</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/csv-to-parquet.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sensor-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">schedule_interval</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_daily</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	[</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">ExternalTaskSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">         	   	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	    	</span><span style="color: #8FBCBB">external_dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sensor-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	        	</span><span style="color: #8FBCBB">check_existence</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">ExternalTaskSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">         	   	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	    	</span><span style="color: #8FBCBB">external_dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sensor-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	        	</span><span style="color: #8FBCBB">check_existence</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	] &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-events-with-users.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">update-hive-table</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/update-hive-table.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sensors-reports-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_reports_1</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">ExternalTaskSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        		</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">external_dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sensor-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">check_existence</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-report</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-report-1.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sensors-reports-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_reports_2</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">ExternalTaskSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-for-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">external_dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sensor-events-with-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">check_existence</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span></span>
<span class="line"><span style="color: #D8DEE9FF">	) &gt;&gt; </span><span style="color: #8FBCBB">SparkSubmitOperator</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">compute-report</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	    	</span><span style="color: #8FBCBB">application</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">/spark/applications/compute-report-2.py</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">	)</span></span></code></pre></div>



<p></p>



<h1 class="wp-block-heading">Short comparison</h1>



<figure class="wp-block-table wp-block-table--scrolled"><table><tbody><tr><td></td><td><strong><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">External</mark></strong></td><td><strong><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Dataset</mark></strong></td><td><strong><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Trigger</mark></strong></td><td><strong><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Sensor</mark></strong></td></tr><tr><td>Multiple triggers for a single DAG</td><td>Yes</td><td>No</td><td>Yes</td><td>Yes (although a bit obscure)</td></tr><tr><td>Multiple requirements to run a single DAG</td><td>Yes</td><td>Yes</td><td>No</td><td>Yes</td></tr><tr><td>Dependencies on different time interval</td><td>Yes</td><td>No</td><td>No</td><td>Yes</td></tr><tr><td><strong>Strong points</strong></td><td>Integration with 3rd party tools.</td><td>Good lineage support in the UI.Push scheduling.</td><td>Push scheduling.</td><td>Allows execution time manipulation (daily to weekly DAGs).</td></tr><tr><td><strong>Weak points</strong></td><td>Sensitive to refactoring.</td><td>No control over schedule intervals.</td><td>Weak error handling. Any DAG can depend on only up to one DAG.</td><td>Pull scheduling.</td></tr></tbody></table><figcaption class="wp-element-caption">The options compared</figcaption></figure>



<h1 class="wp-block-heading">Example scenarios</h1>



<p>The end goal of running Airflow is to orchestrate an actual pipeline, so let’s discuss some actual-life examples where DAG dependencies are necessary.</p>



<h2 class="wp-block-heading">Breaking down complex DAG</h2>



<p>First, DAG is an example of a complex DAG used to calculate multiple outputs (exports) from some external inputs. For clarity, all operations are split into waiting (sensors), preprocessing and reports computation. The screenshot shows them in the form of a Graph view from Airflow.</p>



<p>The assumption is that all operations consume whole input datasets &#8211; they are not restricted to, e.g. current hour only. Otherwise, using sensors or external markers would be the right option.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="575" src="https://tantusdata.com/app/uploads/2023/08/graph.5-1024x575.jpg" alt="" class="wp-image-1756" srcset="https://tantusdata.com/app/uploads/2023/08/graph.5-1024x575.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/graph.5-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/08/graph.5-768x432.jpg 768w, https://tantusdata.com/app/uploads/2023/08/graph.5-1536x863.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/graph.5-2048x1151.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This situation is where datasets (or, according to the docs &#8211; data-aware scheduling) show good results. For each branching point, we define an outlet &#8211; abstraction over the dataset produced by each operator (remember that it is only the metadata &#8211; handling the actual data is not done by Airflow and never should be). Then we define each chain of operations that takes those datasets as a separate DAG with datasets defined as the schedule. As a result, the DAGs are short &amp; simple, and we got a bit of data lineage in the Datasets tab of Airflow for free:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="300" src="https://tantusdata.com/app/uploads/2023/08/graph.6-1024x300.jpg" alt="" class="wp-image-1744" srcset="https://tantusdata.com/app/uploads/2023/08/graph.6-1024x300.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/graph.6-300x88.jpg 300w, https://tantusdata.com/app/uploads/2023/08/graph.6-768x225.jpg 768w, https://tantusdata.com/app/uploads/2023/08/graph.6-1536x450.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/graph.6-2048x600.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Source code</h3>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="from __future__ import annotations


from datetime import datetime


from airflow import Dataset
from airflow.models import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.empty import EmptyOperator
from airflow.sensors.bash import BashSensor


dataset_clients = Dataset(&quot;af://dataset/clients&quot;)
dataset_events = Dataset(&quot;af://dataset/events&quot;)
dataset_transactions = Dataset(&quot;af://dataset/transactions&quot;)
dataset_users = Dataset(&quot;af://dataset/users&quot;)


with DAG(
	dag_id=&quot;complex-dag-clients&quot;,
	schedule_interval=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as complex_dag_clients:
	sensor_clients = BashSensor(task_id=&quot;wait-clients&quot;, bash_command=&quot;true&quot;)


	op_preprocess_step_1 = EmptyOperator(task_id=&quot;preprocess-step-1&quot;)
	op_preprocess_step_2 = EmptyOperator(task_id=&quot;preprocess-step-2&quot;)
	op_preprocess_step_3 = EmptyOperator(task_id=&quot;preprocess-step-3&quot;)
	op_preprocess_step_4 = EmptyOperator(task_id=&quot;preprocess-step-4&quot;, outlets=[dataset_clients])


	sensor_clients &gt;&gt; op_preprocess_step_1 &gt;&gt; \
    	[op_preprocess_step_2 &gt;&gt; op_preprocess_step_3] &gt;&gt; \
    	op_preprocess_step_4


with DAG(
	dag_id=&quot;complex-dag-events&quot;,
	schedule_interval=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as complex_dag_events:
	sensor_events = BashSensor(task_id=&quot;wait-events&quot;, bash_command=&quot;true&quot;)


	op_preprocess_step_1 = EmptyOperator(task_id=&quot;preprocess-step-1&quot;)
	op_preprocess_step_2 = EmptyOperator(task_id=&quot;preprocess-step-2&quot;)
	op_preprocess_step_3 = EmptyOperator(task_id=&quot;preprocess-step-3&quot;, outlets=[dataset_events])


	sensor_events &gt;&gt; op_preprocess_step_1 &gt;&gt; op_preprocess_step_2 &gt;&gt; op_preprocess_step_3


with DAG(
	dag_id=&quot;complex-dag-transactions&quot;,
	schedule_interval=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as complex_dag_transactions:
	sensor_gateway_01 = BashSensor(task_id=&quot;wait-gateway-01&quot;, bash_command=&quot;true&quot;)
	sensor_gateway_02 = BashSensor(task_id=&quot;wait-gateway-02&quot;, bash_command=&quot;true&quot;)
	sensor_gateway_03 = BashSensor(task_id=&quot;wait-gateway-03&quot;, bash_command=&quot;true&quot;)
	sensor_gateway_04 = BashSensor(task_id=&quot;wait-gateway-04&quot;, bash_command=&quot;true&quot;)
	sensor_gateway_05 = BashSensor(task_id=&quot;wait-gateway-05&quot;, bash_command=&quot;true&quot;)


	op_preprocess_step_1 = EmptyOperator(task_id=&quot;preprocess-step-1&quot;)
	op_preprocess_step_2 = EmptyOperator(task_id=&quot;preprocess-step-2&quot;)
	op_preprocess_step_3 = EmptyOperator(task_id=&quot;preprocess-step-3&quot;)
	op_preprocess_step_4 = EmptyOperator(task_id=&quot;preprocess-step-4&quot;)
	op_preprocess_step_5 = EmptyOperator(task_id=&quot;preprocess-step-5&quot;, outlets=[dataset_transactions])


	[
    	sensor_gateway_01,
    	sensor_gateway_02,
    	sensor_gateway_03,
    	sensor_gateway_04,
    	sensor_gateway_05
	] &gt;&gt; op_preprocess_step_1 &gt;&gt; op_preprocess_step_2 &gt;&gt; op_preprocess_step_3 &gt;&gt; \
    	op_preprocess_step_4 &gt;&gt; op_preprocess_step_5


with DAG(
	dag_id=&quot;complex-dag-users&quot;,
	schedule_interval=&quot;@daily&quot;,
	start_date=datetime(2023, 7, 1),
) as complex_dag_users:
	sensor_users = BashSensor(task_id=&quot;wait-users&quot;, bash_command=&quot;true&quot;)


	op_preprocess_step_1 = EmptyOperator(task_id=&quot;preprocess-step-1&quot;)
	op_preprocess_step_2 = EmptyOperator(task_id=&quot;preprocess-step-2&quot;)
	op_preprocess_step_3 = EmptyOperator(task_id=&quot;preprocess-step-3&quot;)
	op_preprocess_step_4 = EmptyOperator(task_id=&quot;preprocess-step-4&quot;)
	op_preprocess_step_5 = EmptyOperator(task_id=&quot;preprocess-step-5&quot;)
	op_preprocess_step_6 = EmptyOperator(task_id=&quot;preprocess-step-6&quot;, outlets=[dataset_users])


	sensor_users &gt;&gt; [op_preprocess_step_1, op_preprocess_step_2] &gt;&gt; op_preprocess_step_3 &gt;&gt; \
    	[op_preprocess_step_4, op_preprocess_step_5] &gt;&gt; op_preprocess_step_6


with DAG(
	dag_id=&quot;complex-dag-report-clients&quot;,
	schedule=[dataset_clients, dataset_events],
	start_date=datetime(2023, 7, 1),
) as complex_dag_report_clients:
	op_rich_clients = EmptyOperator(task_id=&quot;enrich-clients&quot;)
	op_report_clients = BashOperator(task_id=&quot;report-clients&quot;, bash_command=&quot;sleep 3&quot;)


	op_rich_clients &gt;&gt; op_report_clients


with DAG(
	dag_id=&quot;complex-dag-report-users&quot;,
	schedule=[dataset_events, dataset_users],
	start_date=datetime(2023, 7, 1),
) as complex_dag_report_users:
	op_rich_users = EmptyOperator(task_id=&quot;enrich-events&quot;)
	op_report_users = BashOperator(task_id=&quot;report-users&quot;, bash_command=&quot;sleep 3&quot;)


	op_rich_users &gt;&gt; op_report_users


with DAG(
	dag_id=&quot;complex-dag-report-full&quot;,
	schedule=[dataset_events, dataset_transactions, dataset_users],
	start_date=datetime(2023, 7, 1),
) as complex_dag_report_full:
	op_report_users = BashOperator(task_id=&quot;report-full&quot;, bash_command=&quot;sleep 3&quot;)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__future__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">annotations</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Dataset</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">models</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">bash</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BashOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">empty</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">EmptyOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">bash</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BashSensor</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">dataset_clients</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">af://dataset/clients</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">dataset_events</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">af://dataset/events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">dataset_transactions</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">af://dataset/transactions</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">dataset_users</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Dataset</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">af://dataset/users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">complex-dag-clients</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule_interval</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">complex_dag_clients</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_clients</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashSensor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-clients</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">true</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_1</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_2</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_3</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_4</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-4</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">outlets</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_clients</span><span style="color: #D8DEE9FF">])</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_clients</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_1</span><span style="color: #D8DEE9FF"> &gt;&gt; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	[</span><span style="color: #8FBCBB">op_preprocess_step_2</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_3</span><span style="color: #D8DEE9FF">] &gt;&gt; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">op_preprocess_step_4</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">complex-dag-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule_interval</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">complex_dag_events</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_events</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashSensor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">true</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_1</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_2</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_3</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">outlets</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_events</span><span style="color: #D8DEE9FF">])</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_events</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_1</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_2</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_3</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">complex-dag-transactions</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule_interval</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">complex_dag_transactions</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_gateway_01</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashSensor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-gateway-01</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">true</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_gateway_02</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashSensor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-gateway-02</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">true</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_gateway_03</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashSensor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-gateway-03</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">true</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_gateway_04</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashSensor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-gateway-04</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">true</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_gateway_05</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashSensor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-gateway-05</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">true</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_1</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_2</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_3</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_4</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-4</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_5</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-5</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">outlets</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_transactions</span><span style="color: #D8DEE9FF">])</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	[</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">sensor_gateway_01</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">sensor_gateway_02</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">sensor_gateway_03</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">sensor_gateway_04</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">sensor_gateway_05</span></span>
<span class="line"><span style="color: #D8DEE9FF">	] &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_1</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_2</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_3</span><span style="color: #D8DEE9FF"> &gt;&gt; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">op_preprocess_step_4</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_5</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">complex-dag-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule_interval</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">complex_dag_users</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_users</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashSensor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">true</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_1</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_2</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_3</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_4</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-4</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_5</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-5</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_preprocess_step_6</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">preprocess-step-6</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">outlets</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_users</span><span style="color: #D8DEE9FF">])</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensor_users</span><span style="color: #D8DEE9FF"> &gt;&gt; [</span><span style="color: #8FBCBB">op_preprocess_step_1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">op_preprocess_step_2</span><span style="color: #D8DEE9FF">] &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_3</span><span style="color: #D8DEE9FF"> &gt;&gt; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	[</span><span style="color: #8FBCBB">op_preprocess_step_4</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">op_preprocess_step_5</span><span style="color: #D8DEE9FF">] &gt;&gt; </span><span style="color: #8FBCBB">op_preprocess_step_6</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">complex-dag-report-clients</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_clients</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dataset_events</span><span style="color: #D8DEE9FF">]</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">complex_dag_report_clients</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_rich_clients</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">enrich-clients</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_report_clients</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">report-clients</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sleep 3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_rich_clients</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_report_clients</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">complex-dag-report-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_events</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dataset_users</span><span style="color: #D8DEE9FF">]</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">complex_dag_report_users</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_rich_users</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">EmptyOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">enrich-events</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_report_users</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">report-users</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sleep 3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_rich_users</span><span style="color: #D8DEE9FF"> &gt;&gt; </span><span style="color: #8FBCBB">op_report_users</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">complex-dag-report-full</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">schedule</span><span style="color: #D8DEE9FF">=[</span><span style="color: #8FBCBB">dataset_events</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dataset_transactions</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dataset_users</span><span style="color: #D8DEE9FF">]</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">complex_dag_report_full</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">op_report_users</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">report-full</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sleep 3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<h2 class="wp-block-heading">Daily to weekly transition</h2>



<p>This example covers any time-based aggregation of the data. Regardless of the actual time intervals, the general idea is that multiple runs of one DAG are used as input for the other one:</p>



<ul class="wp-block-list">
<li>24 runs of the hourly pipeline for daily one,</li>



<li>7 runs of the daily DAG for the weekly one.</li>
</ul>



<p>This setting limits available options to using an external tool or implementing DAG-run sensors. Let’s take a look at the latter one.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="300" src="https://tantusdata.com/app/uploads/2023/08/graph7-1024x300.jpg" alt="" class="wp-image-1739" srcset="https://tantusdata.com/app/uploads/2023/08/graph7-1024x300.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/graph7-300x88.jpg 300w, https://tantusdata.com/app/uploads/2023/08/graph7-768x225.jpg 768w, https://tantusdata.com/app/uploads/2023/08/graph7-1536x450.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/graph7-2048x600.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The idea behind this diagram is simple &#8211; the daily pipeline is unaffected by the relationship. At the same time, the weekly one starts with seven sensors of ExternalTaskSensor, each with a different execution date offset. Those have been waiting for daily DAGs from 1 to 7 days ago. Apart from that set of sensors, the DAG is similar to others.</p>



<h3 class="wp-block-heading">Source code</h3>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="from __future__ import annotations


from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.operators.bash import BashOperator
from airflow.sensors.bash import BashSensor
from airflow.sensors.external_task import ExternalTaskSensor
from airflow.utils.task_group import TaskGroup


with DAG(
    	dag_id=&quot;dag-daily&quot;,
    	schedule_interval=&quot;@daily&quot;,
    	start_date=datetime(2023, 7, 1),
    	catchup=True,
) as dag_daily:
	BashSensor(task_id=&quot;some-input-wait&quot;, bash_command=&quot;true&quot;) &gt;&gt; \
    	BashOperator(task_id=&quot;some-data-processing-1&quot;, bash_command=&quot;sleep 3&quot;) &gt;&gt; \
    	BashOperator(task_id=&quot;some-data-processing-2&quot;, bash_command=&quot;sleep 3&quot;) &gt;&gt; \
    	BashOperator(task_id=&quot;exporting-to-hdfs&quot;, bash_command=&quot;sleep 3&quot;)


with DAG(
    	dag_id=&quot;dag-weekly&quot;,
    	schedule_interval=&quot;@weekly&quot;,
    	start_date=datetime(2023, 7, 1),
    	catchup=True,
) as dag_weekly:
	# TaskGroup is purely optional, but makes DAG in the UI a bit clearer.
	with TaskGroup(group_id=&quot;dag-daily-sensors&quot;) as sensors:
    	[
        	ExternalTaskSensor(
            		task_id=f&quot;wait-{days_offset}d&quot;,
	            	external_dag_id=&quot;dag-daily&quot;,
       		     	timeout=24 * 60 * 60,
            		mode=&quot;reschedule&quot;,
	            	execution_delta=timedelta(days=days_offset)
        	) for days_offset in range(1, 8)
    	]


	sensors &gt;&gt; \
    		BashOperator(task_id=&quot;some-data-processing&quot;, bash_command=&quot;sleep 3&quot;) &gt;&gt; \
	    	BashOperator(task_id=&quot;exporting-to-hdfs&quot;, bash_command=&quot;sleep 3&quot;)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__future__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">annotations</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">datetime</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">timedelta</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">models</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">operators</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">bash</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BashOperator</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">bash</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BashSensor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">external_task</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ExternalTaskSensor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">airflow</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">utils</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">task_group</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">TaskGroup</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dag-daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">schedule_interval</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_daily</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">BashSensor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">some-input-wait</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">true</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">) &gt;&gt; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">some-data-processing-1</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sleep 3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">) &gt;&gt; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">some-data-processing-2</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sleep 3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">) &gt;&gt; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">exporting-to-hdfs</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sleep 3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dag-weekly</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">schedule_interval</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">@weekly</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">start_date</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">datetime</span><span style="color: #D8DEE9FF">(2023</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 7</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 1)</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	</span><span style="color: #8FBCBB">catchup</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dag_weekly</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">	# </span><span style="color: #8FBCBB">TaskGroup</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">purely</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">optional</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">but</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">makes</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DAG</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">UI</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bit</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">clearer</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">TaskGroup</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">group_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dag-daily-sensors</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	[</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	</span><span style="color: #8FBCBB">ExternalTaskSensor</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">            		</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">wait-{days_offset}d</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	            	</span><span style="color: #8FBCBB">external_dag_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dag-daily</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">       		     	</span><span style="color: #8FBCBB">timeout</span><span style="color: #D8DEE9FF">=24 </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> 60 </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> 60</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">            		</span><span style="color: #8FBCBB">mode</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">reschedule</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">	            	</span><span style="color: #8FBCBB">execution_delta</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">timedelta</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">days</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">days_offset</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        	) </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">days_offset</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">range</span><span style="color: #D8DEE9FF">(1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 8)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    	]</span></span>
<span class="line"></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">	</span><span style="color: #8FBCBB">sensors</span><span style="color: #D8DEE9FF"> &gt;&gt; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">    		</span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">some-data-processing</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sleep 3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">) &gt;&gt; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">	    	</span><span style="color: #8FBCBB">BashOperator</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">task_id</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">exporting-to-hdfs</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bash_command</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">sleep 3</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<p>The source code for all examples and the docker environment to run them is available on GitHub.</p>
<p>The post <a href="https://tantusdata.com/insights/managing-inter-dag-dependencies-in-airflow/">Managing inter-DAG dependencies in Airflow</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Big cartesian join in Big Query</title>
		<link>https://tantusdata.com/insights/big-cartesian-join-in-big-query/</link>
		
		<dc:creator><![CDATA[Amadeusz Kosik]]></dc:creator>
		<pubDate>Wed, 23 Aug 2023 10:19:33 +0000</pubDate>
				<category><![CDATA[big query]]></category>
		<category><![CDATA[cartesian join]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1603</guid>

					<description><![CDATA[<p>When working on more advanced analytics (or reports), you may stumble upon the problem of doing a self-join. Be it all pairs of available products, account connections in social media or anything else; it requires creating a heavy self-join on the data. Since this pattern is not as commonly documented as aggregation by group or [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/big-cartesian-join-in-big-query/">Big cartesian join in Big Query</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://tantusdata.com/app/uploads/2023/08/TantusData.cartesian.join_-1024x576.jpg" alt="" class="wp-image-1725" srcset="https://tantusdata.com/app/uploads/2023/08/TantusData.cartesian.join_-1024x576.jpg 1024w, https://tantusdata.com/app/uploads/2023/08/TantusData.cartesian.join_-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/08/TantusData.cartesian.join_-768x432.jpg 768w, https://tantusdata.com/app/uploads/2023/08/TantusData.cartesian.join_-1536x864.jpg 1536w, https://tantusdata.com/app/uploads/2023/08/TantusData.cartesian.join_-2048x1152.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>When working on more advanced analytics (or reports), you may stumble upon the problem of doing a self-join. Be it all pairs of available products, account connections in social media or anything else; it requires creating a heavy self-join on the data. Since this pattern is not as commonly documented as aggregation by group or filtering, let’s look into possible implementations of it. In this case, we use BigQuery service on the Google Cloud Platform.</p>



<h2 class="wp-block-heading">The challenge</h2>



<p>We have a table of products available in shops:&nbsp;</p>



<p>1. One row represents one product in one shop.</p>



<p>2. If a product is available in multiple stores, it will have multiple rows, each with an appropriate shop ID.</p>



<p>As a result, we need a table of pairs of products:</p>



<p>+ All to all product pairs are computed for each shop and month, and we do not want cross-month or cross-store pairs (if the first product is from shop 38 and month 2001-04, the second one must be from the same period and store),</p>



<p>+ The table will be queried with a month predicate &#8211; only one single month per query will be requested,</p>



<p>+ Each pair of products within one month &amp; shop should appear only once (if (A, B) pair is present, there should not be a row of (B, A) for a given month and shop).</p>



<p>All queries must run successfully in GCP BigQuery on-demand pricing mode.</p>



<h2 class="wp-block-heading">Input data</h2>



<p>In this case, we are going to look into the following table containing input data:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="CREATE TABLE IF NOT EXISTS db.products AS
 SELECT
   product,
   shop,
   transactions,
   transaction_date
 FROM
   db.products_ingestion;
" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">CREATE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">TABLE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">IF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">NOT</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">EXISTS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">SELECT</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">product</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">shop</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">transactions</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">transaction_date</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">FROM</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products_ingestion</span><span style="color: #81A1C1">;</span></span>
<span class="line"></span></code></pre></div>



<p></p>



<p>This table is the input for self-join. The available columns are narrowed to a minimum for the sake of clarity:</p>



<ul class="wp-block-list">
<li>Product (containing product id &#8211; those will be cross-joined).</li>



<li>Shop (shop id).</li>



<li>Transactions (number of transactions involving a given product).</li>



<li>Transaction_date (effectively truncated to year-month).&nbsp;</li>
</ul>



<h2 class="wp-block-heading">Naive approach</h2>



<p>The first thing that comes to mind when doing the self-join is to do the self-join. Let’s test the simplest approach:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="CREATE TABLE db.products_pairs AS
 SELECT
   lhs.product AS product_1,
   rhs.product AS product_2,
   lhs.transactions AS transactions_1,
   rhs.transactions AS transactions_2,
   lhs.shop AS shop,
   lhs.transaction_date AS transaction_date
 FROM (SELECT * FROM db.products) lhs
 INNER JOIN (SELECT * FROM db.products) rhs
 USING (shop, transaction_date)
 WHERE
   lhs.transaction_date = DATE(2022, 1, 1) 
      AND rhs.transaction_date = DATE(2022, 1, 1) 
      AND lhs.product &lt; rhs.product;" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">CREATE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">TABLE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products_pairs</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">SELECT</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product_1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">rhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product_2</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transactions</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transactions_1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">rhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transactions</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transactions_2</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">shop</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">shop</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transaction_date</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">FROM</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">SELECT</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">lhs</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">INNER</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">JOIN</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">SELECT</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">rhs</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">USING</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">shop</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">WHERE</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">DATE</span><span style="color: #D8DEE9FF">(</span><span style="color: #B48EAD">2022</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">) </span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #D8DEE9">AND</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">rhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">DATE</span><span style="color: #D8DEE9FF">(</span><span style="color: #B48EAD">2022</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">) </span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #D8DEE9">AND</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">&lt;</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">rhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #81A1C1">;</span></span></code></pre></div>



<p></p>



<p>The quiet assumption here is to run data filtering at the input level (before the join) and select only a single month of data on each query. The last predicate is to filter out product duplicates. Although this query looks good, we end up with either of those error messages:</p>



<p>Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.</p>



<p>The query exceeded resource limits. This query used 1666502 CPU seconds but would charge only 2394M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 1666502 CPU seconds were used, and this query must use less than 612800 CPU seconds.</p>



<p>The BigQuery engine seems to look into input data size and estimate the join cost based on that data. This case appears to be a corner case, though &#8211; we use only one table, but twice. This is clearly not considered in this computation, and using only one table fools the engine. Let’s fool it deeper and make it estimate the cost correctly.&nbsp;</p>



<h2 class="wp-block-heading">Complicated approach</h2>



<p>We are going to look into two approaches:&nbsp;</p>



<ul class="wp-block-list">
<li>tweaking the query to make the BigQuery engine calculate the cost correctly,</li>



<li>recreating the query to get rid of self-join.</li>
</ul>



<p>Along with comparing those two ideas, we will look at the impact of using the two most apparent techniques for fine-tuning queries: partitioning and clustering. They should improve both the execution time and cost &#8211; let’s see how effective they are.</p>



<h2 class="wp-block-heading">Fixed join approach</h2>



<p>Changing the cartesian join goes first. One trick must be applied: each side of the join must come from a different table. This can be accomplished by creating a 1:1 copy of the input table. We cannot use a view here &#8211; BQ is smart enough to look at the source of the view.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="CREATE TABLE IF NOT EXISTS db.products_aux AS
 SELECT * FROM db.products;
CREATE TABLE db.products_pairs_join AS
 SELECT
   lhs.product AS product_1,
   rhs.product AS product_2,
   lhs.transactions AS transactions_1,
   rhs.transactions AS transactions_2,
   lhs.shop AS shop,
   lhs.transaction_date AS transaction_date
 FROM (SELECT * FROM db.products) lhs
 INNER JOIN (SELECT * FROM db.products_aux) rhs
 USING (shop, transaction_date)
 WHERE
   lhs.transaction_date = DATE(2022, 1, 1) 
      AND rhs.transaction_date = DATE(2022, 1, 1) 
      AND lhs.product &lt; rhs.product;" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">CREATE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">TABLE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">IF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">NOT</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">EXISTS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products_aux</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">SELECT</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products</span><span style="color: #81A1C1">;</span></span>
<span class="line"><span style="color: #D8DEE9">CREATE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">TABLE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products_pairs_join</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">SELECT</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product_1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">rhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product_2</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transactions</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transactions_1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">rhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transactions</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transactions_2</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">shop</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">shop</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transaction_date</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">FROM</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">SELECT</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">lhs</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">INNER</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">JOIN</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">SELECT</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products_aux</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">rhs</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">USING</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">shop</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">WHERE</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">DATE</span><span style="color: #D8DEE9FF">(</span><span style="color: #B48EAD">2022</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">) </span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #D8DEE9">AND</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">rhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">DATE</span><span style="color: #D8DEE9FF">(</span><span style="color: #B48EAD">2022</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">) </span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #D8DEE9">AND</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">lhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">&lt;</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">rhs</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #81A1C1">;</span></span></code></pre></div>



<p></p>



<p>One downside of this approach is that Google does not recommend cartesian joins. It also creates a duplicate of the input table, which must be managed and synchronized with the input. Nevertheless, it is possible to run such a query successfully.</p>



<h2 class="wp-block-heading">Windowing approach</h2>



<p>Aggregation and windowing can be employed to compute the output table to avoid doing a self-join. It is a bit similar to the computation of rolling sum (or average), but instead of a sum function, rows will be collected into an array: for each shop and transaction date, we want all the following products.&nbsp;</p>



<figure class="wp-block-table wp-block-table--scrolled"><table><tbody><tr><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Original</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Original</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Original</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Original</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Appended aggregate</mark></td></tr><tr><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Shop</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Transaction Date</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Product</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Transactions</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Product 2</mark></td></tr><tr><td>Shop 1</td><td>2020-01-01</td><td>Product 1</td><td>7</td><td>(Product 2, 6)<br>(Product 3, 3)<br>(Product 4, 2)<br>(Product 5, 6)</td></tr><tr><td>Shop 1</td><td>2020-01-01</td><td>Product 2</td><td>6</td><td>(Product 3, 3)<br>(Product 4, 2)<br>(Product 5, 6)</td></tr><tr><td>Shop 1</td><td>2020-01-01</td><td>Product 3</td><td>3</td><td>(Product 4, 2)<br>(Product 5, 6)</td></tr><tr><td>Shop 1</td><td>2020-01-01</td><td>Product 4</td><td>2</td><td>(Product 5, 6)</td></tr><tr><td>Shop 1</td><td>2020-01-01</td><td>Product 5</td><td>6</td><td><em>null</em></td></tr><tr><td>Shop 2</td><td>2020-01-01</td><td>Product 1</td><td>7</td><td>(Product 2, 3)<br>(Product 3, 1)</td></tr><tr><td>Shop 2</td><td>2020-01-01</td><td>Product 2</td><td>3</td><td>(Product 3, 1)</td></tr><tr><td>Shop 2</td><td>2020-01-01</td><td>Product 3</td><td>1</td><td><em>null</em></td></tr></tbody></table><figcaption class="wp-element-caption">Input product table</figcaption></figure>



<p>The second step is splitting array elements into separate rows via UNNEST instruction to produce the final output. It also clears null rows that cannot be joined when unnesting.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="CREATE TABLE IF NOT EXISTS db.products_pair_aggregate AS
 WITH root AS (
   SELECT *,
       ARRAY_AGG(STRUCT (product, transactions)) OVER (product_window) AS product_2
       FROM db.products
       WHERE transaction_date = DATE(2022, 1, 1)
       WINDOW product_window
         AS (PARTITION BY shop, transaction_date 
             ORDER BY product ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
 )
 SELECT
   shop,
   transaction_date,
   root.product AS product_1,
   product_2.product AS product_2,
   root.transactions AS transactions_1,
   product_2.transactions AS transactions_2
   FROM root, UNNEST(product_2) product_2
 WHERE root.transaction_date = DATE(2022, 1, 1);
" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">CREATE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">TABLE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">IF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">NOT</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">EXISTS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products_pair_aggregate</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">WITH</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">root</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">AS</span><span style="color: #D8DEE9FF"> (</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">SELECT</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">*</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #88C0D0">ARRAY_AGG</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">STRUCT</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">product</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transactions</span><span style="color: #D8DEE9FF">)) </span><span style="color: #88C0D0">OVER</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">product_window</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product_2</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">db</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">products</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">WHERE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">DATE</span><span style="color: #D8DEE9FF">(</span><span style="color: #B48EAD">2022</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">WINDOW</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product_window</span></span>
<span class="line"><span style="color: #D8DEE9FF">         </span><span style="color: #88C0D0">AS</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">PARTITION</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">BY</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">shop</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">             </span><span style="color: #D8DEE9">ORDER</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">BY</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">ROWS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">BETWEEN</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">FOLLOWING</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AND</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">UNBOUNDED</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">FOLLOWING</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF"> )</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">SELECT</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">shop</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">transaction_date</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">root</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product_1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">product_2</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product_2</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">root</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transactions</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transactions_1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">product_2</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transactions</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">AS</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">transactions_2</span></span>
<span class="line"><span style="color: #D8DEE9FF">   </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">root</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">UNNEST</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">product_2</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">product_2</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">WHERE</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">root</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">transaction_date</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">DATE</span><span style="color: #D8DEE9FF">(</span><span style="color: #B48EAD">2022</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">)</span><span style="color: #81A1C1">;</span></span>
<span class="line"></span></code></pre></div>



<p></p>



<p>The final query is more complicated than one in the join approach, but it does not require creating any ‘hacky’ mirror tables.</p>



<h2 class="wp-block-heading">Results</h2>



<p>Both solutions can be implemented as unpartitioned, partitioned or partitioned and clustered tables (and queries creating them); therefore, there are six implementations to compare. Five of them work with BQ on-demand pricing, and the aggregation approach with only partitioning fails.</p>



<p>The results are presented in the table below. Each approach has been measured in three configurations: vanilla (no optimisation), partitioning (P) as well as partitioning and clustering (P+C). The join benchmark includes creating the mirror table (upper row) and the join itself (lower).</p>



<figure class="wp-block-table wp-block-table--scrolled"><table><tbody><tr><td></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Join</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Join P</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Join <br>P + C</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Aggregate</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Aggregate<br>P</mark></td><td><mark style="background-color:rgba(0, 0, 0, 0);color:#c8855a" class="has-inline-color">Aggregate <br>P + C</mark></td></tr><tr><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Processed size</mark></td><td>2 340 MB<br>4 670 MB</td><td>2 340 MB<br>229 MB</td><td>2 340 MB<br>229 MB</td><td>2 340 MB</td><td><em>failed</em></td><td>115 MB</td></tr><tr><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Billed size</mark></td><td>2 340 MB<br>4 670 MB</td><td>2 340 MB<br>229 MB</td><td>2 340 MB<br>229 MB</td><td>2 340 MB</td><td><em>failed</em></td><td>115 MB</td></tr><tr><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Wall time</mark></td><td>4s<br>1m 54s</td><td>9s<br>29m 48s</td><td>3s<br>1m 54s</td><td>1m 31s</td><td><em>failed</em></td><td>7m 34s</td></tr><tr><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Slot time</mark></td><td>4m 35s<br>7h 18m</td><td>3m 40s<br>9h 16m</td><td>6m 27s<br>10h 19m</td><td>8h 20m</td><td><em>failed</em></td><td>15h 07m</td></tr><tr><td><mark style="background-color:rgba(0, 0, 0, 0);color:#827a02" class="has-inline-color">Shuffled</mark></td><td>5 GB<br>423 GB</td><td>6 GB<br>507 GB</td><td>6 GB<br>841 GB</td><td>737 GB</td><td><em>failed</em></td><td>840 GB</td></tr></tbody></table><figcaption class="wp-element-caption">Resulting product pairs</figcaption></figure>



<h2 class="wp-block-heading">Main points</h2>



<p>1. Using partitioning on filtered tables is necessary to bring the cost of a query into acceptable range.&nbsp;</p>



<p>2. Both join, and aggregate-based queries can produce the output in a reasonable amount of time and money. The aggregate one will further halve the cost, as it does not require creating a mirror table.</p>



<p>3. Apart from the pure benchmarking results, it is good to compare the query complexity when deciding on the approach. The join query may be trickier to maintain due to the extra table, but it is way easier to understand than the aggregate one.</p>
<p>The post <a href="https://tantusdata.com/insights/big-cartesian-join-in-big-query/">Big cartesian join in Big Query</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
