<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>TantusData, Author at TantusData</title>
	<atom:link href="https://tantusdata.com/author/tantusdata/feed/" rel="self" type="application/rss+xml" />
	<link>https://tantusdata.com</link>
	<description>That uncovers wisdom.</description>
	<lastBuildDate>Wed, 23 Jul 2025 11:37:51 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.7.1</generator>

<image>
	<url>https://tantusdata.com/app/uploads/2023/01/cropped-Favicon-32x32.png</url>
	<title>TantusData, Author at TantusData</title>
	<link>https://tantusdata.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Speeding up chatbot with tools</title>
		<link>https://tantusdata.com/insights/speeding-up-chatbot-with-tools/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Wed, 23 Jul 2025 11:37:50 +0000</pubDate>
				<category><![CDATA[Chatbot Development]]></category>
		<category><![CDATA[LangChain]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[UserExperience]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=2197</guid>

					<description><![CDATA[<p>OVERVIEW Large Language Models (LLMs) can now use plug-ins to access extra tools. But they often respond slowly when these tools are used. This hurts the user experience.&#160; In this post, we’ll show a simple trick that speeds up LLM responses. It will help you build more practical and efficient LLM-based solutions. GOALS EXAMPLE USE [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/speeding-up-chatbot-with-tools/">Speeding up chatbot with tools</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image"><img decoding="async" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXc1AF3CGgfuNmwVPLWtVJwzZ9jqlQ_zgPhfyRLY3gS_mHU5ojNTpe_Evbw35j9jxpcQJytuG3hzvZaBY4GOu5mJUerTZCQ4LKdlxTAFraAW3mAn4dw9GFPrECWQbZRcOLQW4a8IVw?key=dws2ubo1QasOI7JwnJrYJuDE" alt=""/></figure>



<h2 class="wp-block-heading"><strong>OVERVIEW</strong></h2>



<p>Large Language Models (LLMs) can now use plug-ins to access extra tools. But they often respond slowly when these tools are used. This hurts the user experience.&nbsp;</p>



<p>In this post, we’ll show a simple trick that speeds up LLM responses. It will help you build more practical and efficient LLM-based solutions.</p>



<h2 class="wp-block-heading"><strong>GOALS</strong></h2>



<ol class="wp-block-list">
<li> Speed up bot answers.</li>
</ol>



<h2 class="wp-block-heading"><strong>EXAMPLE USE CASE</strong></h2>



<p>Let’s say we’re using OpenAI’s ChatGPT via langchain for our holiday rental website where people can talk with chat asking about our website&#8217;s current holiday offer or some general geography information. We leave the geography part solely to the LLM, but the one with showing our deals we have to implement on our end. Langchain gives this possibility by using <a href="https://python.langchain.com/docs/concepts/tools/">tools</a>.&nbsp;</p>



<p>Let’s say our chatbot function looks already like this:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#1E1E1E"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="def get_chatbot_answer(user_message: str) -&gt; str" style="color:#D4D4D4;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki dark-plus" style="background-color: #1E1E1E" tabindex="0"><code><span class="line"><span style="color: #9CDCFE">def</span><span style="color: #D4D4D4"> </span><span style="color: #DCDCAA">get_chatbot_answer</span><span style="color: #D4D4D4">(</span><span style="color: #9CDCFE">user_message</span><span style="color: #D4D4D4">: </span><span style="color: #9CDCFE">str</span><span style="color: #D4D4D4">) -&gt; </span><span style="color: #9CDCFE">str</span></span></code></pre></div>



<p></p>



<p>and the chat has a tool function that looks like this:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#1E1E1E"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="def get_holiday_offers(place: str, month: str) -&gt; str" style="color:#D4D4D4;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki dark-plus" style="background-color: #1E1E1E" tabindex="0"><code><span class="line"><span style="color: #9CDCFE">def</span><span style="color: #D4D4D4"> </span><span style="color: #DCDCAA">get_holiday_offers</span><span style="color: #D4D4D4">(</span><span style="color: #9CDCFE">place</span><span style="color: #D4D4D4">: </span><span style="color: #9CDCFE">str</span><span style="color: #D4D4D4">, </span><span style="color: #9CDCFE">month</span><span style="color: #D4D4D4">: </span><span style="color: #9CDCFE">str</span><span style="color: #D4D4D4">) -&gt; </span><span style="color: #9CDCFE">str</span></span></code></pre></div>



<p></p>



<p>While they’re very easy to use and applicable in this case, the time spent processing tool responses by LLM can be huge. Huge enough to discourage some users from using our chatbot..<br></p>



<h2 class="wp-block-heading"><strong>SOLUTION</strong></h2>



<p>We can divide our chatbot into 2 chatbots, one that will do the same as original, and the other that will decide if we’re using the first one, or the user just wants to get holiday offers (that way we don’t have to use the first, long one). In the other case, we can use the tool ourselves.</p>



<p>So, we can firstly create the first chatbot, that will return 3 outputs, (we can use <a href="https://python.langchain.com/docs/concepts/structured_outputs/">structured output</a> for that). It will look like this:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#1E1E1E"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="def chatbot_preprocessing(user_message: str) -&gt; (bool, str, str)" style="color:#D4D4D4;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki dark-plus" style="background-color: #1E1E1E" tabindex="0"><code><span class="line"><span style="color: #9CDCFE">def</span><span style="color: #D4D4D4"> </span><span style="color: #DCDCAA">chatbot_preprocessing</span><span style="color: #D4D4D4">(</span><span style="color: #9CDCFE">user_message</span><span style="color: #D4D4D4">: </span><span style="color: #9CDCFE">str</span><span style="color: #D4D4D4">) -&gt; (</span><span style="color: #9CDCFE">bool</span><span style="color: #D4D4D4">, </span><span style="color: #9CDCFE">str</span><span style="color: #D4D4D4">, </span><span style="color: #9CDCFE">str</span><span style="color: #D4D4D4">)</span></span></code></pre></div>



<p></p>



<p>and will return boolean if the user only wants to see holiday offers, and 2 next variables are place, and month respectively (we only care about those if the boolean is true). Now we can change a flow of our chat a bit:<br><span id="docs-internal-guid-b803e045-7fff-f42a-203a-38793c4e4aca" style="white-space: normal;"></span><span id="docs-internal-guid-b803e045-7fff-f42a-203a-38793c4e4aca" style="white-space: normal;"></span></p>



<p class="has-text-align-center"></p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#1E1E1E"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="def get_chatbot_answer_with_preprocessing(user_message: str) -&gt; str:
	only_show_offers, place, month = chatbot_preprocessing(user_message)
	if only_show_offers:
		return get_holiday_offers(place, month)
	else:
		return get_chatbot_answer(user_message)" style="color:#D4D4D4;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki dark-plus" style="background-color: #1E1E1E" tabindex="0"><code><span class="line"><span style="color: #9CDCFE">def</span><span style="color: #D4D4D4"> </span><span style="color: #DCDCAA">get_chatbot_answer_with_preprocessing</span><span style="color: #D4D4D4">(</span><span style="color: #9CDCFE">user_message</span><span style="color: #D4D4D4">: </span><span style="color: #9CDCFE">str</span><span style="color: #D4D4D4">) -&gt; </span><span style="color: #C8C8C8">str</span><span style="color: #D4D4D4">:</span></span>
<span class="line"><span style="color: #D4D4D4">	</span><span style="color: #9CDCFE">only_show_offers</span><span style="color: #D4D4D4">, </span><span style="color: #9CDCFE">place</span><span style="color: #D4D4D4">, </span><span style="color: #9CDCFE">month</span><span style="color: #D4D4D4"> = </span><span style="color: #DCDCAA">chatbot_preprocessing</span><span style="color: #D4D4D4">(</span><span style="color: #9CDCFE">user_message</span><span style="color: #D4D4D4">)</span></span>
<span class="line"><span style="color: #D4D4D4">	</span><span style="color: #C586C0">if</span><span style="color: #D4D4D4"> </span><span style="color: #C8C8C8">only_show_offers</span><span style="color: #D4D4D4">:</span></span>
<span class="line"><span style="color: #D4D4D4">		</span><span style="color: #C586C0">return</span><span style="color: #D4D4D4"> </span><span style="color: #DCDCAA">get_holiday_offers</span><span style="color: #D4D4D4">(</span><span style="color: #9CDCFE">place</span><span style="color: #D4D4D4">, </span><span style="color: #9CDCFE">month</span><span style="color: #D4D4D4">)</span></span>
<span class="line"><span style="color: #D4D4D4">	</span><span style="color: #C586C0">else</span><span style="color: #D4D4D4">:</span></span>
<span class="line"><span style="color: #D4D4D4">		</span><span style="color: #C586C0">return</span><span style="color: #D4D4D4"> </span><span style="color: #DCDCAA">get_chatbot_answer</span><span style="color: #D4D4D4">(</span><span style="color: #9CDCFE">user_message</span><span style="color: #D4D4D4">)</span></span></code></pre></div>



<p class="has-text-align-center"></p>



<p></p>



<p>This way, if the user wants only offers (which is probably more than half the time), the chatbot answers very quickly, because it avoids processing the result by the LLM.</p>



<h2 class="wp-block-heading"><strong>CONCLUSION</strong></h2>



<p>It’s very possible to speed up chat conversation not only by getting faster hardware, but also by using some tricks. That particular trick helps a lot with simple chatbots that are made to help website users to quickly see all the available products and services.<br><br></p>



<figure class="wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex"></figure>
<p>The post <a href="https://tantusdata.com/insights/speeding-up-chatbot-with-tools/">Speeding up chatbot with tools</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>LLMs and LangChain &#8211; Getting Started Guide </title>
		<link>https://tantusdata.com/insights/llms-and-langchain-getting-started-guide/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Thu, 17 Apr 2025 14:10:06 +0000</pubDate>
				<category><![CDATA[ChatBot]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Custom LLM Integration]]></category>
		<category><![CDATA[Embeddings]]></category>
		<category><![CDATA[LangChain]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[RAG]]></category>
		<category><![CDATA[Workshops]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=2194</guid>

					<description><![CDATA[<p>Hands-on workshop on building GPT-powered apps with LangChain, Agents, embeddings, and vector databases. Learn to create accurate, real-time AI solutions.</p>
<p>The post <a href="https://tantusdata.com/insights/llms-and-langchain-getting-started-guide/">LLMs and LangChain &#8211; Getting Started Guide </a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image"><img decoding="async" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdkblVpi4RIKkpplA394yrz2huIAU6Cfitr1RakjnO_eE2nRGj3S-6ayAhDFv625mPcvx8Tc3yWP4A2zmQ5CR-FvdjI0H0Mb55r2bp4jmGHhLG41xLyLH8_H7FDUN9qMTHchkO1?key=hPVoxYPm7EKr5RpXGhisew2k" alt=""/></figure>



<p></p>



<p>During their workshop at Big Data Conference Europe, <strong>Marcin and Bartek</strong> guided participants through the essentials of GPT-powered LLM applications. By the end of the session, attendees had a solid grasp of developing with Chains and powerful reasoning Agents, integrating external APIs and context sources, and building solutions that excel in question-answering over documents—all with a focus on accuracy and safety.</p>



<h2 class="wp-block-heading"><strong>What it covered:&nbsp;</strong></h2>



<ul class="wp-block-list">
<li><strong>Foundations of LLMs:</strong> Participants discovered which Large Language Models best suited different needs and how to apply them effectively.</li>



<li><strong>LangChain Essentials:</strong> They gained a solid understanding of the LangChain API and learned how Chains and Agents can drive dynamic, intelligent behavior in applications.</li>



<li><strong>Context and Integrations:</strong> Attendees explored how to integrate external APIs, pass context between components, and leverage vector databases for advanced features.</li>



<li><strong>Question Answering with Your Own Data:</strong> They used embeddings and combined ChatGPT, VectorDb, and LangChain to develop robust question-answering systems.</li>



<li><strong>Reasoning Agents:</strong> Participants examined how Agents can utilize real-time tools such as Google Search or Wolfram Alpha for powerful, on-the-fly problem-solving.</li>



<li><strong>Accuracy and Safety:</strong> They delved into techniques for self-querying, hallucination checks, and output moderation to ensure reliable application performance.</li>



<li><strong>Tuning and Production:</strong> Finally, they got a glimpse into real-world deployment, from optimizing embeddings to managing inference and costs.</li>
</ul>



<p></p>



<h2 class="wp-block-heading"><strong>Missed the Session? Don’t Worry!</strong></h2>



<p>Couldn’t make it to our workshop at Big Data Conference Europe? We’ve got you covered. We can bring the same interactive experience straight to your team.</p>



<p></p>



<h2 class="wp-block-heading"><strong>By the End of This Workshop, You Will:</strong></h2>



<ul class="wp-block-list">
<li>Gain a strong understanding of ChatGPT-powered LLM applications.</li>



<li>Master the LangChain API to build and orchestrate Chains and Agents for complex decision-making.</li>



<li>Develop real-world applications integrating external APIs and data.</li>



<li>Build reasoning Agents that tackle dynamic problems in real-time.</li>



<li>Implement crucial techniques for application safety and accuracy.</li>



<li>Combine ChatGPT, VectorDb, and LangChain into powerful question-answering systems.</li>



<li>Level up your AI skills and transform your creative ideas into functional, future-ready solutions.</li>
</ul>



<p></p>



<h2 class="wp-block-heading"><strong>Hands-On Agenda</strong></h2>



<ol class="wp-block-list">
<li><strong>Introduction</strong>
<ul class="wp-block-list">
<li>Overview of various LLMs – what’s good for your use case?</li>



<li>Introduction to LangChain</li>
</ul>
</li>



<li><strong>Building with Chains and Agents</strong>
<ul class="wp-block-list">
<li>LangChain API</li>



<li>Passing context between components</li>



<li>Integrations with external APIs and data sources</li>



<li>Comparing Chains vs. Agents</li>
</ul>
</li>



<li><strong>Question Answering Over Documents</strong>
<ul class="wp-block-list">
<li>Introduction to embeddings</li>



<li>Overview of vector databases</li>



<li>Converting documents to vectors (plus common gotchas)</li>



<li>Building a simple application with ChatGPT, VectorDb, and LangChain</li>
</ul>
</li>



<li><strong>Creating Powerful Reasoning Agents</strong>
<ul class="wp-block-list">
<li>Dynamic decision-making</li>



<li>Integrations with tools (e.g., Google Search, Wolfram Alpha)</li>
</ul>
</li>



<li><strong>Summary &amp; Best Practices</strong>
<ul class="wp-block-list">
<li>Techniques for improving accuracy and safety: self-querying, hallucination checks, and output moderation</li>



<li>LLM in production: key challenges and how to tackle them</li>



<li>Potentials for tuning: embeddings, inference optimization, cost management</li>
</ul>
</li>
</ol>



<p></p>



<h2 class="wp-block-heading"><strong>Ready for a Custom Workshop?</strong></h2>



<p>Looking to tailor this session to your organization’s specific needs? We can help. <strong><a href="https://tantusdata.com/contact-us/">Contact us</a></strong> to explore how we can design a workshop that unleashes the power of generative AI for your projects. Let’s build the future together!<br><br></p>



<p></p>
<p>The post <a href="https://tantusdata.com/insights/llms-and-langchain-getting-started-guide/">LLMs and LangChain &#8211; Getting Started Guide </a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>The TantusData Summer Challenge: terms and conditions</title>
		<link>https://tantusdata.com/insights/the-tantusdata-summer-challenge-terms-and-conditions/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Fri, 14 Jul 2023 11:32:05 +0000</pubDate>
				<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1598</guid>

					<description><![CDATA[<p>Rules of the “Summer Challenge” contest on LinkedIn § 1 [Definitions] The phrases and terms used in the content of these Rules shall have the meanings indicated below:1. „Organizer&#8221; &#8211; TantusData Sp. z o.o. based in POLAND, ul.&#160; Alexa Niepodległości 132/136 unit 3, 02-544 Warszawa, under KRS number: 0000930059, NIP: 5213944638, REGON: 520344085, being the [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/the-tantusdata-summer-challenge-terms-and-conditions/">The TantusData Summer Challenge: terms and conditions</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Rules of the “Summer Challenge” contest on LinkedIn</h2>



<p></p>



<p><strong>§ 1 [Definitions]</strong></p>



<p><br>The phrases and terms used in the content of these Rules shall have the meanings indicated below:<br>1. „Organizer&#8221; &#8211; TantusData Sp. z o.o. based in POLAND, ul.&nbsp; Alexa Niepodległości 132/136 unit 3, 02-544 Warszawa, under KRS number: 0000930059, NIP: 5213944638, REGON: 520344085, being the owner of the TantusData brand;<br>2. &#8220;Rules&#8221; &#8211; these Rules of the Contest entitled “TantusData Summer Challenge”, binding for the Organizer and Contestants, regulating the terms and conditions of the Contest, in particular terms of participation in the Contest, rights and obligations of the Organizer, Contest Committee and Contestants in relation to their participation in the Contest;<br>3. “Contest&#8221; &#8211; the Contest entitled “TantusData Summer Challenge&#8221;, binding for the Organizer and Contestants, regulating the terms and conditions of the Contest, in particular terms of participation in the Contest, rights and obligations of the Organizer, Contest Committee and Contestants in relation to their participation in the Contest;<br>4. &#8220;LinkedIn&#8221; &#8211; the website under the domain www.linkedin.com , where the Contest is partially held, owned by the Microsoft corporation.<br>5. &#8220;Fanpages&#8221; &#8211; the profile named “@TantusData&#8221; on the LinkedIn website available at: <em>https://www.linkedin</em> owned and administered by the Organizer;<br>6. “Contestant&#8221; &#8211; a LinkedIn user who has made a correct and effective entry to the Contest, meeting the eligibility requirements for participation in the Contest as described in § 3 of the Rules;<br>8. “Contest Committee” &#8211; a team consisting of persons selected by the Organizer, which evaluates the entries;</p>



<p>9. „Winner&#8221; &#8211; a User who has submitted a correct and effective entry to the Contest and whose entry was awarded by the Selection Committee in accordance with the provisions of §4 and §6 of these Terms and Conditions.</p>



<p><br><strong>§ 2 [General provisions]</strong></p>



<p><br>1. The Organiser is the founder of the prizes.<br>2. The Organizer is the administrator of personal data provided by Contestants.<br>3. Providing personal data is optional, but necessary for the Participant to enter the Competition. Persons who make their data available have the right to access, modify or delete them.<br>4. These Rules define the conditions of the Contest.<br>5. The Contest is not created, administered, endorsed or sponsored by the social networking site LinkedIn or by the Microsoft Corporation or by Kaggle with Google LLC. &#8220;LinkedIn&#8221; is a registered trademark by the <a href="https://trademarks.justia.com/owners/linkedin-corporation-1472257/">LinkedIn Corporation</a>. <br>6. The competition is carried out on User’s profile and Organizers Fanpage.<br>7. The Organizer&#8217;s Contest Committee supervises the correctness and course of the Contest, i.e. providing information on the Contest and dealing with complaints.<br>8. The Competition is conducted from month 21st July 2023 till September 16th 2023.</p>



<p></p>



<p><strong>§ 3 [Contestants]</strong></p>



<p><br>1. Participants in the Contest may only be natural persons who are consumers, have full legal capacity, are LinkedIn users, have an active account on LinkedIn, abide by the rules of LinkedIn service and who have accepted these Rules.<br>2. Contestant declares, that he/she/they:<br>a. is a natural person with full legal capacity;<br>b. is domiciled in the territory of <em>European Union or United Kingdom of Great Britain and Northern Ireland;</em><br>c. is familiar with the content of the present Rules and voluntarily enters the Contest;</p>



<p>d. agrees to and accepts the terms of these Rules, including the procedure for claiming prizes, and fully accepts them;</p>



<p>e. undertakes to comply with the provisions of these Rules and the rules of the LinkedIn service;</p>



<p>f. has agreed to the processing of his/her/their personal data for the purposes of participating in the Contest;</p>



<p>g. is a registered user of the LinkedIn social network.</p>



<p>3. The employees and collaborators of the Organizer cannot participate in the Contest.</p>



<p><br><strong>§ 4 [Prize]</strong></p>



<p></p>



<p>1. the Prizes will be awarded to the first three (3) persons who, in the opinion of the Contest Committee, best complete the Contest Task and are selected by the Contest Committee as Winners.</p>



<p>2. The Organizer will award 3 prizes to the Participants who will be declared Winners (&#8220;Prizes&#8221;), one award per winner. The Prizes in the Contest are respectively:</p>



<p>a) For first place – an office survival kit containing: &nbsp;50 EURO Amazon gift card, a notebook, 2 post its, 5 pens, a mug and a tea, a bag, a t-shirt, cookies, an adult colouring book, crayons,</p>



<p>b) For second place – an office survival kit containing: &nbsp;50 EURO Amazon gift card, a notebook, 2 post its, 5 pens, a mug and a tea, a bag, a t-shirt, cookies, an adult colouring book, crayons,</p>



<p>c) For third place – an office survival kit containing: &nbsp;50 EURO Amazon gift card, a notebook, 2 post its, 5 pens, a mug and a tea, a bag, a t-shirt, cookies, an adult colouring book, crayons,</p>



<p>3. Prizes will be issued only in the form specified herein, without the possibility of exchanging them for another material prize or for a cash equivalent.&nbsp;</p>



<p>4. The prizes will be issued to the Winners in a manner communicated to them in further communication.</p>



<p>5. The Winners will be informed about the prize and the conditions of Prize collection in the content of the announcement on Fanpage and through a private message sent to the Winners on LinkedIn within 10 working days after the end of the Contest.</p>



<p>6. By entering the Contest, the Contestant agrees that the Organizer may communicate with him/her in connection with the Contest via the LinkedIn account.</p>



<p>7. in order for the Winner to receive the Prize, the necessary data must be sent to the Organizer within 48 hours of the prize communication being sent to the Winner (to the email provided by the Organizer in the message). The necessary data for the Prize release to take place are:</p>



<p>a) name and surname;</p>



<p>b) residence address (for tax purposes)</p>



<p>c) correspondence address (for the purposes of delivering the Prize, it may be different from the residential address);</p>



<p>d) telephone number;</p>



<p>e) e-mail address.</p>



<p>9. Dispatch of the Prizes will take place without unnecessary delay from the day of the end of the Contest. The prizes shall be sent via post to the address indicated by the Winner. The Organizer shall not be liable for the actions or omissions of the delivery service providers.</p>



<p>10. The Winner may surrender the Prize but shall not be entitled to the cash equivalent or any other prize. In the event of the Prize being forfeited, the Organiser reserves the right to award the Prize to another Entrant or decide not to award a Prize. The Winner may not transfer the right to the Prize to third parties.</p>



<p>11. The Winner&#8217;s failure to provide the Organiser with the data referred to in point 7 above or exceeding the permissible time of response, or sending incorrect data, the Winner loses the right to the Prize.</p>



<p>12. The Organizer reserves the right to verify whether the Winners fulfil the conditions stipulated in the Regulations as well as in legal regulations. For this purpose, the Organizer may require the Winner to make specific statements, provide specific data or submit specific documents. Failure to comply with the Regulations or relevant provisions of law, as well as refusal to comply with the above demands, may result in the exclusion of a given person from the Contest, and shall entitle the Organiser to refuse to award the Prize without any claims against the Organiser.</p>



<p>13. All participants who complete the Contest Task appropriately as determined by the Contest Committee, will be awarded with online certificates of successful completion. The certificates will be issued within 14 days of the Contest closing date and will be sent to the participants via the LinkedIn&#8217;s message service. The certificates confirm only the successful completion of the Contest Task and the Organiser is not responsible for the verification of the User&#8217;s completion process, only of the correct solution submitted. </p>



<p><br><strong>§ 5 [Rules of the Contest]</strong></p>



<p></p>



<p>1.<strong> </strong>The Contestant&#8217;s task is tocreate a model to detect whether the same author has written two texts using the datasets provided by TantusData within the challenge on the Kaggle platform. The data provided was collected from articles from our blog and two publicly available datasets of news&nbsp;and blog posts. Participants can also use external datasets. The participants are asked to create a smart solution which can take advantage of even a small dataset. The participants must present this in the format specified on Kaggle under the evaluation tab. The participants are also asked to write a description in a Kaggle notebook and make it public after the submission deadline. The participants are required to submit their entries in Kaggle under the submissions tab for the competition, as specified in the overview. This is the complete Summer Challenge task (hereinafter: &#8220;Contest Task&#8221;). Full instructions and guidelines are available on Kaggle under the challenge section.</p>



<p>2. One Contestant may make multiple submissions, but will be asked to select their best attempt at the Contest Task. The Contestant must work alone.</p>



<p>3. Contestant is forbidden to take any action in connection with participation in the Contest contrary to the law and good manners, and to use data obtained in connection with participation in the Contest for illegal purposes. In particular in the content of an answer to a Competition Task a Contestant cannot include:</p>



<p>a) materials and symbols against the law;</p>



<p>b) materials and symbols violating the rights of third parties or the Organizer, especially</p>



<p>those violating intellectual property rights or the personal rights of third parties or the Organizer;</p>



<p>c) expressions commonly regarded as morally reprehensible or socially inappropriate as well as content violating good morals;</p>



<p>d) obscene or pornographic materials and content;</p>



<p>e) materials and symbols propagating violence or discrimination, inciting racial, religious or ethnic hatred, socially recognised as offensive, vulgar, etc.;</p>



<p>f) content violating the rules of so-called &#8220;netiquette&#8221;;</p>



<p>g) personal data of other people in an unlawful scope, in particular it is forbidden to send an answer to a contest task using someone else&#8217;s name and surname in order to impersonate a particular person;</p>



<p>h) materials, the use of which by the Organiser may hinder or prevent the operation of other programs used by the Organiser, especially materials containing computer viruses.</p>



<p>4. The answers to the Contest task, which do not meet the requirements specified in the Rules, will be excluded from the Contest.</p>



<p>5. In order to ensure the proper organisation and conduct of the Contest, and in particular to assess the accuracy of entries and select the Winners, the Organiser appoints a Contest Committee, which will supervise and arbitrate the Contest.</p>



<p>6. The Contest Committee will award the accurate solutions to the Contest Task, most swiftly and appropriately realising the Contest Task. Submissions are evaluated based on the accuracy score&nbsp;between the predicted label and the true target and a description of a scaled and optimal solution that best fulfils the task with sensible use of resources.</p>



<p>7. From among the submitted Contest Tasks, the Contest Committee will select the Winners, indicating the decisive features of the selection and deciding also on the allocation of places and the distribution of Prizes.</p>



<p>8. The Organizer will also inform about the Winners in a public announcement on the Fanpage.</p>



<p>9. Detailed information about the contest will be available on the Organizer&#8217;s website.</p>



<p><br><strong>§ 6 [Conditions of participation in the Contest]</strong></p>



<p></p>



<p>1. Access to the Contest is free of charge.</p>



<p>2. Necessary conditions for participation in the Contest are:</p>



<p>a. acceptance of the Rules and the correct completion of all tasks described in § 5 Paragraph 1 of the Rules;</p>



<p>b. granting by the Contestant consent to the processing of personal data described in these Regulations by the Organizer (in particular consent described in § 8 of the Rules);</p>



<p>c. transfer of copyrights as referred to in §10 of the Rules.</p>



<p><br><strong>§ 7 [Organiser&#8217;s responsibilities]</strong></p>



<p></p>



<p>1. The Organiser shall not be responsible for the accuracy and truthfulness of the data of the Contestants, including the inability to pass the Prizes, due to reasons attributable to the Participant, in particular if the Participant did not provide a real mailing address, or if the data provided is incomplete or outdated.</p>



<p>2. The Organiser declares that it does not control or monitor the content posted by Participants in terms of reliability and truthfulness, subject to actions related to the removal of violations of the Regulations or generally applicable laws.</p>



<p>3. The Organizer reserves the right to exclude from the Contestants whose actions contravene the law or the Rules, LinkedIn rules, in particular Contestants who:</p>



<p>a) post content that contravenes the applicable law, Rules, LinkedIn rules (in particular, containing offensive content, both in the text and graphic layer);</p>



<p>b) Complete the Contest Task in with help (not on their own);</p>



<p>c) Interfere with the Contest mechanism;</p>



<p>4. Organizer is not responsible for any malfunctions in data communications links, servers, interfaces, browsers, LinkedIn platform, Kaggle platform.</p>



<p>5. The Organizer is not responsible for temporary or permanent blocking of the LinkedIn profiles and pages or any of its mobile applications.</p>



<p></p>



<p><strong>§ 8 [Processing of personal data]</strong></p>



<p></p>



<p>1. Personal data of Contestants, including image, shall be processed by Organizer only for the purpose of performing activities necessary for proper conduct of the Contest.</p>



<p>2. Personal data of the Competition Participants will be stored by the Organiser only for the time necessary to conduct the Competition and award the prizes to the awarded Participants.</p>



<p>3. Personal data will be processed for the following purposes: publication of competition posts and their promotion in:</p>



<p>a. any social media of TantusData brand (LinkedIn; YouTube);</p>



<p>b. on the Organizer&#8217;s website in the domain www.tantusdata.com ;</p>



<p>4. Participants have the right to access, correct and delete processed data. Data is provided on a voluntary basis, with registration on the <strong>??</strong> submission form (e.g. Google form/name of the server) network required for participation in the Competition. The Organiser is not responsible for the way LinkedIn (the service used) processes personal data.</p>



<p>5. The Contestant has at any time the right to request the deletion of his personal data and the data of minors from his post. Upon deletion of the data, the Contestant loses the possibility to participate in the Contest.</p>



<p>6. The use of personal data takes place on global communication channels, without territorial restrictions and without time limits (until the removal of data by the Contestants or the cessation of the Organizer&#8217;s activities).</p>



<p><br><strong>§ 9 [Copyright]</strong></p>



<p></p>



<p>1. It is forbidden to infringe in any way the intellectual property rights in the Contest, especially the unauthorized use by the Contestants of works authored by third parties.</p>



<p>2. Contestants in respect of whom the Organiser has received information that they are not the authors of Contest Tasks, or do not have the rights to the answers to the Contest Task, are subject to exclusion from the Contest.<br>3. In the event that an answer to a Contest Task is a work as defined by the Polish Act on Copyright and Similar Rights, the entirety of the Participant&#8217;s economic copyright to the answer to the Contest Task and the right to permit the exercise of subsidiary rights, without any time and territorial restrictions, passes to the Organiser. The Organiser shall be entitled to use and dispose of the work – the contest posts &#8211; in the following forms of exploitation:<br>a) within the scope of recording and multiplying the work in whole or in part &#8211; the production of copies of the work using a specific technique, including digital reproduction, printing, reprography, magnetic recording and digital technique;<br>b) within the scope of dissemination of the work, in its entirety or in part, in a manner other than specified in item a) above &#8211; placing on social media, websites, publishing in the Organiser&#8217;s promotional and advertising materials, exhibiting, displaying, reproducing, as well as broadcasting and rebroadcasting, as well as making the work available to the public in such a way that everyone can have access to it in a place and time selected by themselves, including via the Internet.</p>



<p></p>



<p><strong>§ 10 [Complaints and notifications of violations]</strong></p>



<p><br>1. Any complaints regarding the way the Contest is carried out should be submitted by Contestant in writing during the Contest, but not later than within 14 (fourteen) days from the date of issuing the Prizes.<br>2. A complaint submitted after the deadline shall have no legal effect.<br>3. A written complaint should contain the name, surname, exact address of the Contestant and<br>a detailed description and justification of the complaint.<br>4. The complaint should be sent by registered mail or courier service to the address of the Organiser.<br>5. Claims will be considered in writing within 30 days from their submittance.</p>



<p><br><strong>§ 11 [Final provisions]</strong></p>



<p></p>



<p>1. The Regulations shall enter into force on the first day of the competition, i.e. July 21st 2023.<br>2. Second In matters not covered by these Regulations shall apply the provisions of the Polish Civil Code and other applicable laws.<br>3. Disputes relating to and arising from the Contest will be resolved by a court of law with jurisdiction over the Organiser&#8217;s registered office.<br>4. Organizer reserves the right to change the rules of the Contest during its duration for important reasons, with the reservation that all possible changes in the Rules will not affect the rights of Contest participants acquired on the basis of the Rules before the date of entry into force of these changes. Information about changes will be posted on the Fanpages.<br>5. A brief description of the Contest rules can be found in advertising and information materials accompanying the Contest, in particular on the Fanpages. All content included in these materials is for information purposes only. Only the provisions of these Regulations are binding.<br>6. Contest Regulations are available on Fanpages.</p>
<p>The post <a href="https://tantusdata.com/insights/the-tantusdata-summer-challenge-terms-and-conditions/">The TantusData Summer Challenge: terms and conditions</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>The Summer Challenge</title>
		<link>https://tantusdata.com/insights/the-tantusdata-summer-challenge/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Thu, 06 Jul 2023 13:58:50 +0000</pubDate>
				<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1597</guid>

					<description><![CDATA[<p>Introducing the TantusData Summer Challenge This summer, we&#8217;re not just beating the heat, we&#8217;re challenging it. At TantusData, we believe in nurturing the approachable spirit of knowledge-sharing, and to celebrate this, we&#8217;re excited to present the TantusData Summer Challenge. Our challenge is an ode to experts in the making and a refresher for the seasoned [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/the-tantusdata-summer-challenge/">The Summer Challenge</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="576" src="https://tantusdata.com/app/uploads/2023/07/SummerChallenge-1-1024x576.jpg" alt="" class="wp-image-1685" srcset="https://tantusdata.com/app/uploads/2023/07/SummerChallenge-1-1024x576.jpg 1024w, https://tantusdata.com/app/uploads/2023/07/SummerChallenge-1-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/07/SummerChallenge-1-768x432.jpg 768w, https://tantusdata.com/app/uploads/2023/07/SummerChallenge-1-1536x864.jpg 1536w, https://tantusdata.com/app/uploads/2023/07/SummerChallenge-1-2048x1152.jpg 2048w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">Introducing the TantusData Summer Challenge</h2>



<p>This summer, we&#8217;re not just beating the heat, we&#8217;re challenging it. At TantusData, we believe in nurturing the approachable spirit of knowledge-sharing, and to celebrate this, we&#8217;re excited to present the TantusData Summer Challenge.</p>



<p>Our challenge is an ode to experts in the making and a refresher for the seasoned professionals. It’s a platform to test your skills and learn, as we will post the optimal solution with instructions later. As a cherry on top, there&#8217;s a reward &#8211; an &#8216;Office Survival Kit&#8217;, filled with a delightful mix of practical and amusing items. This includes a handy notebook, colourful post-its, essential pens, a sarcastic mug paired with a soothing tea, a stylish t-shirt, a tote, appetising cookies for that quick brain boost, and for moments of relaxation, a funny adult colouring book with crayons. To further enhance your office or workspace, we&#8217;re including a 50 EURO Amazon gift card as part of the kit. We have three of these exciting kits ready for our top participants. So buckle up and let your creative juices flow!</p>



<p>At the heart of this challenge lies a model on production topic. We teased this during recent ML conferences and now, it’s time for action. The challenge requires the innovative use of a Siamese Network solution to work with small data sets. Entries will be judged based on the quality and appropriateness of the submission. The finer details are provided below.</p>



<p>And that&#8217;s not all &#8211; all participants who solve the challenge correctly will be awarded certificates.</p>



<p>But wait! We&#8217;ve got a bonus round. A side challenge that promises more thrill and more creativity. Stay tuned to our emails and social media channels, specifically LinkedIn, for the announcement.</p>



<p>A little birdie told us that those who participate in this challenge will have a head start for our upcoming autumn challenge. So, we highly recommend you to keep your work saved.</p>



<h2 class="wp-block-heading">The how&#8217;s and when&#8217;s </h2>



<p><strong>The challenge instructions:</strong> Your task, &#8216;Authorship Comparison&#8217;, is simple but thought-provoking: create a smart solution to determine if two texts share the same author. Rather than developing a massive model, we want to see how well you can optimise a limited dataset. The submissions will be assessed based on the scores and the approaches taken. </p>



<p>All submissions must be through the Kaggle submission system, and if you&#8217;re not a member yet, creating a login is easy and free. Check out the detailed instructions and resources on the competition page <a href="https://www.kaggle.com/competitions/authorship-comparison/overview">here</a>. </p>



<p><strong>Deadlines:</strong> The Authorship Detection challenge opens on the 21st of July 2023. We strongly recommend participants to commence no later than the 11th of September 2023, to provide sufficient time to develop and fine-tune your model.</p>



<p><strong>Important Update:</strong> In response to the feedback from our enthusiastic participants, the submission deadline for the Summer Challenge has been extended. All entries will now be accepted until <strong>16th of September, 2023</strong>, 23:59 CEST. This provides additional time for both new participants to join and for existing participants to refine their submissions. We wish everyone the best of luck! No submissions will be accepted after this time, so ensure your entry is submitted promptly.</p>



<p>For beginners and those looking for extra support, we&#8217;ve provided tips and resources linked within the challenge on Kaggle. These could be very helpful, so make sure to take a look. But remember, going through these resources might take some time in addition to the time needed for the challenge task itself.</p>



<p>Get started now, and best of luck to you all!</p>



<p><strong>The Fine Print:</strong> <a href="https://tantusdata.com/insights/the-tantusdata-summer-challenge-terms-and-conditions/">Here&#8217;s</a> the link to all the terms and conditions of the challenge. Yet, the basic rules of participation are also listed on the challenge tabs in Kaggle, which will go live tomorrow. So you can view them there.</p>



<p>Get ready to compete, learn, and win. And remember, the clock is ticking. Keep an eye out for our announcements!</p>



<p>Good luck, challengers!</p>



<p>#TantusDataSummerChallenge</p>
<p>The post <a href="https://tantusdata.com/insights/the-tantusdata-summer-challenge/">The Summer Challenge</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>J On the Beach x TantusData</title>
		<link>https://tantusdata.com/insights/j-on-the-beach-x-tantusdata/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Mon, 17 Apr 2023 12:39:43 +0000</pubDate>
				<category><![CDATA[Big data technologies]]></category>
		<category><![CDATA[Conference]]></category>
		<category><![CDATA[JOTB23]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1452</guid>

					<description><![CDATA[<p>J On The Beach: Where Developers and Data Scientists Meet to learn and unlock the potential of Big Data Technologies EDUCATE, SHARE, SPREAD, AND LEARN The technical conference brings together developers, data scientists, and DevOps professionals from around the world to explore the latest trends in big data technologies. This immersive event features workshops, hackathons, [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/j-on-the-beach-x-tantusdata/">J On the Beach x TantusData</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="578" src="https://tantusdata.com/app/uploads/2023/04/TantusData_with_JOntheBeach23_May2023-1024x578.jpg" alt="J on The Beach 2023 conference and TantusData" class="wp-image-1455" srcset="https://tantusdata.com/app/uploads/2023/04/TantusData_with_JOntheBeach23_May2023-1024x578.jpg 1024w, https://tantusdata.com/app/uploads/2023/04/TantusData_with_JOntheBeach23_May2023-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/04/TantusData_with_JOntheBeach23_May2023-768x433.jpg 768w, https://tantusdata.com/app/uploads/2023/04/TantusData_with_JOntheBeach23_May2023-1536x867.jpg 1536w, https://tantusdata.com/app/uploads/2023/04/TantusData_with_JOntheBeach23_May2023.jpg 1914w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>J On The Beach: Where Developers and Data Scientists Meet to learn and unlock the potential of Big Data Technologies</p>



<h2 class="wp-block-heading"><strong>EDUCATE, SHARE, SPREAD, AND LEARN</strong></h2>



<p>The technical conference brings together developers, data scientists, and DevOps professionals from around the world to explore the latest trends in big data technologies. This immersive event features workshops, hackathons, and technical talks led by top speakers in the field. Attendees will learn about a range of topics, from data collection and stream processing to machine learning, microservices, artificial intelligence, container systems, and more.</p>



<p>You can read all about this year&#8217;s edition <a href="https://www.jonthebeach.com">here</a>.</p>



<p>On the 10 to 12th of May, fantastic international speakers will join the stage to share their top tips and tricks.</p>



<p>Our own&nbsp;<a href="https://www.linkedin.com/in/ACoAAATwupEBSS-9R8o7J9xB98KPDt2eTyWjCG0">Marcin Szymaniuk</a>&nbsp;will share his best practice takeaways on the 12th of May.&nbsp;You can find more about the topic our expert will present <a href="https://lnkd.in/dZyfyAtA">here</a>.</p>



<p><br>Did we mention that it just happens to be Friday and the conference will conclude with a great party in the evening?</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" width="1024" height="535" src="https://tantusdata.com/app/uploads/2023/04/TantusData_data.science_talk_JOTB23-1024x535.jpg" alt="Marcin Szymaniuk from Tantus Data speaking at J on The Beach" class="wp-image-1453" srcset="https://tantusdata.com/app/uploads/2023/04/TantusData_data.science_talk_JOTB23-1024x535.jpg 1024w, https://tantusdata.com/app/uploads/2023/04/TantusData_data.science_talk_JOTB23-300x157.jpg 300w, https://tantusdata.com/app/uploads/2023/04/TantusData_data.science_talk_JOTB23-768x401.jpg 768w, https://tantusdata.com/app/uploads/2023/04/TantusData_data.science_talk_JOTB23.jpg 1200w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure></div>


<p></p>



<p>So make sure to book your place and join us in the gorgeous Málaga for invaluable insights and strategies, you won&#8217;t learn elsewhere.</p>
<p>The post <a href="https://tantusdata.com/insights/j-on-the-beach-x-tantusdata/">J On the Beach x TantusData</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>All about search</title>
		<link>https://tantusdata.com/insights/all-about-search/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Wed, 29 Mar 2023 11:15:48 +0000</pubDate>
				<category><![CDATA[effective and efficient search]]></category>
		<category><![CDATA[queries]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[types of search engines]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1432</guid>

					<description><![CDATA[<p>Sit back and relax as our experts tell you all there is to know about any search project. Whether you intend to set up a search engine or improve the one you have, there are a few matters, which are crucial to consider before you start. Benefits of search engines As you surely know many [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/all-about-search/">All about search</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="574" src="https://tantusdata.com/app/uploads/2023/03/All-you-need-is-search-1024x574.jpg" alt="Trie Data Structure and search engines, elements to consider." class="wp-image-1433" srcset="https://tantusdata.com/app/uploads/2023/03/All-you-need-is-search-1024x574.jpg 1024w, https://tantusdata.com/app/uploads/2023/03/All-you-need-is-search-300x168.jpg 300w, https://tantusdata.com/app/uploads/2023/03/All-you-need-is-search-768x431.jpg 768w, https://tantusdata.com/app/uploads/2023/03/All-you-need-is-search-1536x861.jpg 1536w, https://tantusdata.com/app/uploads/2023/03/All-you-need-is-search-2048x1148.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Sit back and relax as our experts tell you all there is to know about any search project. </p>



<p>Whether you intend to set up a search engine or improve the one you have, there are a few matters, which are crucial to consider before you start. </p>



<h2 class="wp-block-heading">Benefits of search engines</h2>



<p>As you surely know many websites and e-commerce sites benefit greatly from having search engines operating well. But how do they do that? Why do others fail to deliver such ROI? </p>



<p>Are there other needs than simply letting consumers find the products they are looking for faster?</p>



<p>What to prepare before you start to save money and improve ROI faster. </p>



<p>And, does the project really needs to be that expensive and long-term, or is there an alternate option.</p>



<p>Sounds interesting? Make sure to check out our video prepared by our top search experts Maryna and Jakub.</p>



<p><a href="https://youtu.be/XrBMH-cqrRc">Take me to the video!</a></p>
<p>The post <a href="https://tantusdata.com/insights/all-about-search/">All about search</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>TantusData, the game changer by Clutch</title>
		<link>https://tantusdata.com/insights/tantusdata-a-leading-service-providers-clutch/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Wed, 22 Mar 2023 14:31:21 +0000</pubDate>
				<category><![CDATA[Clutch]]></category>
		<category><![CDATA[reviews]]></category>
		<guid isPermaLink="false">http://tantusdata.local/?post_type=insights&#038;p=568</guid>

					<description><![CDATA[<p>Clutch has recently brought to our attention that we are listed as a leading service provider on Clutch; naming us as one of Poland-headquartered industry game-changer big data analytics firms. For those few who haven’t heard of Clutch yet, it is a B2B ratings and reviews platform based in Washington, DC. The company evaluates technology [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/tantusdata-a-leading-service-providers-clutch/">TantusData, the game changer by Clutch</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="438" src="https://tantusdata.com/app/uploads/2023/01/TantusData_header_clutch-1024x438.jpg" alt="TantusData as the leading service provider" class="wp-image-1275" srcset="https://tantusdata.com/app/uploads/2023/01/TantusData_header_clutch-1024x438.jpg 1024w, https://tantusdata.com/app/uploads/2023/01/TantusData_header_clutch-300x128.jpg 300w, https://tantusdata.com/app/uploads/2023/01/TantusData_header_clutch-768x328.jpg 768w, https://tantusdata.com/app/uploads/2023/01/TantusData_header_clutch.jpg 1520w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">Clutch has recently brought to our attention that we are <a href="https://clutch.co/pl/it-services/analytics">listed as</a> a leading service provider on Clutch; naming us as one of Poland-headquartered industry game-changer big data analytics firms.</h2>



<p style="font-size:18px">For those few who haven’t heard of Clutch yet, it is a B2B ratings and reviews platform based in Washington, DC. The company evaluates technology service and solutions companies based on the quality of work, thought leadership, and other key highlights from clients’ reviews.</p>



<p style="font-size:18px">We are a data company specialising in big data solutions. Subsequently, we excel at data platforms, data-driven systems, analytics, and advanced machine learning. As such, our services range from data infrastructure and cloud optimisation to ELT and re-engineering mission-critical systems. You can see a selection of our case studies at Clutch in the portfolio section.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="535" src="https://tantusdata.com/app/uploads/2023/01/TantusData_clutch2-1024x535.jpg" alt="Case study of TantusData on the Clutch profile." class="wp-image-1278" srcset="https://tantusdata.com/app/uploads/2023/01/TantusData_clutch2-1024x535.jpg 1024w, https://tantusdata.com/app/uploads/2023/01/TantusData_clutch2-300x157.jpg 300w, https://tantusdata.com/app/uploads/2023/01/TantusData_clutch2-768x401.jpg 768w, https://tantusdata.com/app/uploads/2023/01/TantusData_clutch2.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">Let&#8217;s hear it from the important ones</h2>



<p>Our clients’ honest and thorough reviews on Clutch have a tremendous impact on us. Because, their feedback helps us improve our services and establish our name in the industry. They also show that we do deliver in a manner that fits our values. Today, we will share some snippets of the reviews we have received so far.</p>



<p><em>“We chose TantusData because of how flexible they were in the way of delivery. It was important for us to have their expert working alongside our internal team full time.” – Principal Engineer, UK’s fastest growing online accommodation marketplaces.</em></p>



<p>This doesn’t come as a surprise. After all, flexibility is at our core. Indeed, we start by focusing on you, to meet where and when you need it, and bring a solution that best fits your company and situation. Therefore, we guarantee the results: what we promise, we deliver. Finally, from start to finish, your project is only taken care of by experts only. Those who have long standing experience in the industry.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="620" src="https://tantusdata.com/app/uploads/2023/03/TantusData_Clutch-1024x620.jpg" alt="The reviews on Clutch for TantusData" class="wp-image-1285" srcset="https://tantusdata.com/app/uploads/2023/03/TantusData_Clutch-1024x620.jpg 1024w, https://tantusdata.com/app/uploads/2023/03/TantusData_Clutch-300x182.jpg 300w, https://tantusdata.com/app/uploads/2023/03/TantusData_Clutch-768x465.jpg 768w, https://tantusdata.com/app/uploads/2023/03/TantusData_Clutch-1536x930.jpg 1536w, https://tantusdata.com/app/uploads/2023/03/TantusData_Clutch-2048x1240.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p><em>“The onboarding and collaboration was smooth since both technical and communication skills of the team were very good”. – </em>Director of Data &amp; Advanced Analytics, large scale retail company</p>



<p>We speak your language<strong> </strong>and ensure that the problem is not only solved but also in a way and at a pace convenient to you. Whether you have a long-term project, need extra hands on board mid-delivery, or need some crucial last-minute help — we’re quick to respond.&nbsp;</p>



<p><em>“Effective communication and very quick turnaround when unexpected problems arise.” – CTO, 360 Restaurant Management System provider</em></p>



<script type="text/javascript" src="https://widget.clutch.co/static/js/widget.js"></script> <div class="clutch-widget" data-url="https://widget.clutch.co" data-widget-type="5" data-height="auto" data-nofollow="true" data-expandifr="true" data-scale="100" data-header-color="#17313B" data-footer-color="#17313B" data-scale="100" data-primary-color="#08537E" data-secondary-color="#08537E" data-clutchcompany-id="1972193"></div>



<p></p>



<p><em>“One of the key resources of the company is their unique skill set. Such a mix of data science and engineering coupled with clear understanding of rules of GDPR and ability to tune performance is hard to find on the market.” – Head of Data Science, one of the largest telecom companies in Europe</em></p>



<p>We are a flexible, agile, and unequaled group of experts. This means we can address any situation and provide the best solutions, 100% tailored and effective long-term. So, it’s not about the new and popular solutions but the ones that will serve you best.</p>



<p><em>“I appreciated that they investigated several solutions and delivered them to us with an explanation. They genuinely cared about the outcomes of the models they built and stayed actively involved throughout split testing on live environments.” – Principal Engineer, UK’s fastest growing online accommodation marketplaces.</em></p>



<p>Simultaneously, TantusData ensures you have all the knowledge you need straight from our specialized team so that you can fully leverage the solution to create value but also confidently make decisions regarding data.<br><em>“We could see that they really cared about meeting our needs, but most importantly about the project long term, that it goes well.”</em> – CTA, Depict.ai</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="535" src="https://tantusdata.com/app/uploads/2023/01/TantusData_clutch1-1024x535.jpg" alt="TantusData portfolio on the Clutch platform" class="wp-image-1280" srcset="https://tantusdata.com/app/uploads/2023/01/TantusData_clutch1-1024x535.jpg 1024w, https://tantusdata.com/app/uploads/2023/01/TantusData_clutch1-300x157.jpg 300w, https://tantusdata.com/app/uploads/2023/01/TantusData_clutch1-768x401.jpg 768w, https://tantusdata.com/app/uploads/2023/01/TantusData_clutch1.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">Our badges</h2>



<p>The whole team at TantusData is proud to be the recipients of the special Clutch Badges for the top providers. To be sure, we will continue to go the extra mile, no matter the project or business size. We are delighted to see our clients appreciate our hard work. If you’re curious to see more about the reviews, check out our <a href="https://clutch.co/profile/tantusdata#summary">Clutch profile</a>. There you can also view selected projects from our portfolio.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="535" src="https://tantusdata.com/app/uploads/2023/03/TantusData_clutch_nagrody-1024x535.jpg" alt="Clutch badges TantusData has received." class="wp-image-1282" srcset="https://tantusdata.com/app/uploads/2023/03/TantusData_clutch_nagrody-1024x535.jpg 1024w, https://tantusdata.com/app/uploads/2023/03/TantusData_clutch_nagrody-300x157.jpg 300w, https://tantusdata.com/app/uploads/2023/03/TantusData_clutch_nagrody-768x401.jpg 768w, https://tantusdata.com/app/uploads/2023/03/TantusData_clutch_nagrody.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>We help our clients get up to speed with taking advantage of their data. <a href="https://tantusdata.com/#contact">Contact us</a> now!</p>
<p>The post <a href="https://tantusdata.com/insights/tantusdata-a-leading-service-providers-clutch/">TantusData, the game changer by Clutch</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to waste money in the cloud</title>
		<link>https://tantusdata.com/insights/how-to-waste-money-in-the-cloud/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Thu, 03 Feb 2022 09:48:50 +0000</pubDate>
				<category><![CDATA[AWS]]></category>
		<category><![CDATA[cost optimisation]]></category>
		<guid isPermaLink="false">http://tantusdata.local/?post_type=insights&#038;p=569</guid>

					<description><![CDATA[<p>Expense optimization is often the main reason for migrating from on-premise to the cloud. The combination of pay-as-you-go and flexible provisioning reduces the problem of overestimated and overprovisioned compute resources. However, in order to actually reduce infrastructure bills, one has to fully understand the cloud pricing model. Otherwise, invoice total may be a huge surprise. [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/how-to-waste-money-in-the-cloud/">How to waste money in the cloud</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://tantusdata.com/app/uploads/2023/03/Data_cloud-1-1024x576.jpg" alt="data in the cloud" class="wp-image-1401" srcset="https://tantusdata.com/app/uploads/2023/03/Data_cloud-1-1024x576.jpg 1024w, https://tantusdata.com/app/uploads/2023/03/Data_cloud-1-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/03/Data_cloud-1-768x432.jpg 768w, https://tantusdata.com/app/uploads/2023/03/Data_cloud-1-1536x863.jpg 1536w, https://tantusdata.com/app/uploads/2023/03/Data_cloud-1-2048x1151.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Expense optimization is often the main reason for migrating from on-premise to the cloud. The combination of pay-as-you-go and flexible provisioning reduces the problem of overestimated and overprovisioned compute resources. However, in order to actually reduce infrastructure bills, one has to fully understand the cloud pricing model. Otherwise, invoice total may be a huge surprise.</p>



<p>The AWS cost is driven by four metrics:</p>



<ul class="wp-block-list">
<li>compute power &amp; memory,</li>



<li>storage,</li>



<li>processing time,</li>



<li>data transfer out.</li>
</ul>



<p>However, those are just general guidelines. Using AWS cost calculator and reading through pricing tables should not be omitted. All three scenarios below are real-world examples that happened when using the cloud without estimating the costs.</p>



<h2 class="wp-block-heading">S3 endpoint – public or private?</h2>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://tantusdata.com/app/uploads/2023/05/Public-or-private-1024x576.jpg" alt="S3 endpoint: the private vs public" class="wp-image-1571" srcset="https://tantusdata.com/app/uploads/2023/05/Public-or-private-1024x576.jpg 1024w, https://tantusdata.com/app/uploads/2023/05/Public-or-private-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/05/Public-or-private-768x432.jpg 768w, https://tantusdata.com/app/uploads/2023/05/Public-or-private.jpg 1366w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>PC gives one ability to choose between deploying into a&nbsp;<em>public</em>&nbsp;or&nbsp;<em>private&nbsp;</em>subnet. In case of the latter one, VMs get access to the internet via&nbsp;<em>NAT Gateway</em>, which is charged by hour and data transfer.</p>



<p>Having said that, let’s estimate how much would it cost to send 1PB of data from&nbsp;<em>S3</em>&nbsp;to a private subnet? If one does not use S3 endpoint and go with&nbsp;<em>VM –&gt; NAT Gateway –&gt; public S3 endpoint</em>&nbsp;path, then the NAT GW charge for data transfer will apply: 0.045 x 1024 x 1024 ~= $47k. The alternative, which is using a private endpoint, is free.</p>



<p>For further reading click <a href="https://aws.amazon.com/blogs/aws/new-vpc-endpoint-for-amazon-s3/">HERE</a></p>



<h2 class="wp-block-heading">CloudWatch – paid per metric, not point.</h2>



<p>Even though CloudWatch is one of the most essential services, it usually does not significantly contribute to the AWS bill. Metrics are charged per number of metrics and data retrieval API calls. Even if one has hundreds of VMs with tens of monitored metrics, the cost will be hardly visible among other expenses.</p>



<p>However, it also means that it may not be suitable for thousands of sparse metrics. Let’s estimate again: 10k time series, each one producing 256 bytes per minute or ~153MB per hour (10 000 x 0.000256 x 60). This would cost $3000 per month (10k x $0.3) for CloudWatch metrics only. For very basic comparison: one m4.large instance with 200G of storage costs $93 per month.</p>



<h2 class="wp-block-heading">A word of advice</h2>



<p>Both of the above situations resulted in a visible spike in the cloud spend. Luckly, in both situations there was budget monitoring and cost reporting already set up. This allows to react quickly, investigate the bill and identify troublesome services.</p>



<p>This example shows that a fully-cloud solution is just a little different than working on an on-premise one. However, estimating expected cost of the whole application, paired with monitoring cloud spent (day-to-day usage) greatly helps to keep expenses reasonable.&nbsp;</p>
<p>The post <a href="https://tantusdata.com/insights/how-to-waste-money-in-the-cloud/">How to waste money in the cloud</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Spark shuffle – Case #3 – using salt in repartition</title>
		<link>https://tantusdata.com/insights/spark-shuffle-case-3-using-salt-in-repartition/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Mon, 02 Nov 2020 10:37:00 +0000</pubDate>
				<category><![CDATA[repartition]]></category>
		<category><![CDATA[SPARK]]></category>
		<guid isPermaLink="false">https://tantusdata.com/?post_type=insights&#038;p=1396</guid>

					<description><![CDATA[<p>Why use salt in repartition? In the previous blog entry we saw how a skew in a processed dataset is affecting performance of Spark jobs. We resolved the problem by repartitioning the dataset by a column which naturally splits the data into reasonably sized chunks. But what if we don’t have such columns in our [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/spark-shuffle-case-3-using-salt-in-repartition/">Spark shuffle – Case #3 – using salt in repartition</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://tantusdata.com/app/uploads/2023/03/article-7-header-1024x576.jpg" alt="use case big data" class="wp-image-1407" srcset="https://tantusdata.com/app/uploads/2023/03/article-7-header-1024x576.jpg 1024w, https://tantusdata.com/app/uploads/2023/03/article-7-header-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/03/article-7-header-768x432.jpg 768w, https://tantusdata.com/app/uploads/2023/03/article-7-header-1536x863.jpg 1536w, https://tantusdata.com/app/uploads/2023/03/article-7-header-2048x1151.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p style="font-size:18px">Why use salt in repartition? In the previous blog entry we saw how a skew in a processed dataset is affecting performance of Spark jobs. We resolved the problem by repartitioning the dataset by a column which naturally splits the data into reasonably sized chunks. But what if we don’t have such columns in our dataset? Or&nbsp;what if you would like&nbsp;to control a number of files/file size produced by your Spark job? For the sake of this exercise let’s assume that we have 1.1T of data: 7 days of events. This means roughly 165G per day.</p>



<p style="font-size:18px">Keep in mind that&nbsp;<em>hour</em>&nbsp;attribute we used before splits one day of data into 24 files. Quick back of the envelope calculation leads to conclusion that the size of the file we expect is 165/24=7G. If you are unlucky two different hours might end up in the same file which means 14G file. This is problematic because it might require you to have beefy executors. Other than that it limits the parallelism – the maximum executors you can effectively throw at a problem is 24×7=156 – it would be a problem if your logic required heavy calculations. Because of that using&nbsp;<em>hour</em>&nbsp;column and creating 24 files per day is not the number you aim for in this case.</p>



<p style="font-size:18px">What can you do if you want to produce more files per output dir? But still control the number&#8230;</p>



<p style="font-size:18px">The trick is quite simple – add a column with random value (salt). The value will be random but you can control the range of generated values. Then you can just use that column when repartitioning. The range you choose will reflect number of files generated for each directory*. After repartitioning the data will be organised into spark partitions in an expected way. As the last step you need to remove the salt column you just introduced – it holds no business value. This is what the code would look like:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="df
      .withColumn(&quot;year&quot;, year(col(&quot;eventTimestamp&quot;)))
      .withColumn(&quot;month&quot;, month(col(&quot;eventTimestamp&quot;)))
      .withColumn(&quot;day&quot;, dayofmonth(col(&quot;eventTimestamp&quot;)))
      .withColumn(&quot;salt&quot;, toInt(rand() * 540))
      .repartition(col(&quot;year&quot;), col(&quot;month&quot;), col(&quot;day&quot;), col(&quot;salt&quot;))
      .drop(&quot;salt&quot;)
      .write
      .partitionBy(&quot;year&quot;, &quot;month&quot;, &quot;day&quot;)
      .save(output)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">df</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">withColumn</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">year</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">year</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">eventTimestamp</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">withColumn</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">month</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">month</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">eventTimestamp</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">withColumn</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">day</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">dayofmonth</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">eventTimestamp</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">withColumn</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">salt</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">toInt</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">rand</span><span style="color: #D8DEE9FF">() </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">540</span><span style="color: #D8DEE9FF">))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">repartition</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">year</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">month</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">day</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">salt</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">drop</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">salt</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">write</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">partitionBy</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">year</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">month</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">day</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">save</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">output</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<p>I want to stress on this part:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code=".withColumn(&quot;salt&quot;, toInt(rand() * 540))
      .repartition(col(&quot;year&quot;), col(&quot;month&quot;), col(&quot;day&quot;), col(&quot;salt&quot;))
      .drop(&quot;salt&quot;)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">withColumn</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">salt</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">toInt</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">rand</span><span style="color: #D8DEE9FF">() </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">540</span><span style="color: #D8DEE9FF">))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">repartition</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">year</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">month</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">day</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">salt</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">drop</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">salt</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<p>&#8230; which is generating new ‘<em>salt</em>‘ column with random int from a range of [0, 540], then repartitioning using the salt and eventually removing the column. The salt is only needed in order to distribute your data into 540 separate partitions instead of having single partitioning processing entire day (which is ~165 GB in our case).</p>



<p>Salting prevents from creating a single file per partition which could be too large and could lead to OOM and other errors. By salting you get much better control of number of files (so size of them) per output directory. When you run the job and observe SparkUI:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="816" src="https://tantusdata.com/app/uploads/2023/07/table.1.TD_-1024x816.jpg" alt="" class="wp-image-1701" srcset="https://tantusdata.com/app/uploads/2023/07/table.1.TD_-1024x816.jpg 1024w, https://tantusdata.com/app/uploads/2023/07/table.1.TD_-300x239.jpg 300w, https://tantusdata.com/app/uploads/2023/07/table.1.TD_-768x612.jpg 768w, https://tantusdata.com/app/uploads/2023/07/table.1.TD_-1536x1224.jpg 1536w, https://tantusdata.com/app/uploads/2023/07/table.1.TD_.jpg 1775w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>You can observe that most of the tasks are processing few hundreds megs. You don’t have very heavy tasks and you don’t have many tasks which are doing nothing – exactly what you want to achieve. Other than that you can confirm the above observation by looking into the files written. There are hundreds of them and each of them is few hundreds megabytes in size.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="303.4 M  910.2 M  /user/marcin/result/year=2018/month=3/day=1/part-00013-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.4 M  910.1 M  /user/marcin/result/year=2018/month=3/day=1/part-00019-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.6 M  910.8 M  /user/marcin/result/year=2018/month=3/day=1/part-00023-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
605.6 M  1.8 G    /user/marcin/result/year=2018/month=3/day=1/part-00031-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
606.2 M  1.8 G    /user/marcin/result/year=2018/month=3/day=1/part-00043-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.4 M  910.1 M  /user/marcin/result/year=2018/month=3/day=1/part-00044-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.6 M  910.9 M  /user/marcin/result/year=2018/month=3/day=1/part-00049-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.2 M  909.6 M  /user/marcin/result/year=2018/month=3/day=1/part-00055-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.7 M  911.0 M  /user/marcin/result/year=2018/month=3/day=1/part-00062-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.9 M  911.7 M  /user/marcin/result/year=2018/month=3/day=1/part-00066-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.7 M  911.2 M  /user/marcin/result/year=2018/month=3/day=1/part-00071-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.5 M  910.5 M  /user/marcin/result/year=2018/month=3/day=1/part-00073-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.2 M  909.7 M  /user/marcin/result/year=2018/month=3/day=1/part-00078-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.4 M  910.1 M  /user/marcin/result/year=2018/month=3/day=1/part-00079-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.7 M  911.2 M  /user/marcin/result/year=2018/month=3/day=1/part-00082-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.6 M  910.9 M  /user/marcin/result/year=2018/month=3/day=1/part-00083-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.6 M  910.7 M  /user/marcin/result/year=2018/month=3/day=1/part-00085-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.8 M  911.4 M  /user/marcin/result/year=2018/month=3/day=1/part-00086-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.7 M  911.1 M  /user/marcin/result/year=2018/month=3/day=1/part-00090-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet
303.3 M  910.0 M  /user/marcin/result/year=2018/month=3/day=1/part-00096-bcbb5461-1df0-46a1-abe9-c11382f6a94c.snappy.parquet" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #B48EAD">303.4</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.2</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00013</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.4</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.1</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00019</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.6</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.8</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00023</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">605.6</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">1.8</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">G</span><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00031</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">606.2</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">1.8</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">G</span><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00043</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.4</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.1</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00044</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.6</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.9</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00049</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.2</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">909.6</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00055</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.7</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">911.0</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00062</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.9</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">911.7</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00066</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.7</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">911.2</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00071</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.5</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.5</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00073</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.2</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">909.7</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00078</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.4</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.1</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00079</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.7</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">911.2</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00082</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.6</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.9</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00083</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.6</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.7</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00085</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.8</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">911.4</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00086</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.7</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">911.1</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00090</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span>
<span class="line"><span style="color: #B48EAD">303.3</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #B48EAD">910.0</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">M</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">user</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">marcin</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">result</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">year</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">2018</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">month</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">3</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">day</span><span style="color: #81A1C1">=</span><span style="color: #B48EAD">1</span><span style="color: #81A1C1">/</span><span style="color: #D8DEE9">part</span><span style="color: #81A1C1">-</span><span style="color: #B48EAD">00096</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">bcbb5461</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">1</span><span style="color: #D8DEE9">df0</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9FF">46</span><span style="color: #D8DEE9">a1</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">abe9</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">c11382f6a94c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">snappy</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">parquet</span></span></code></pre></div>



<p></p>



<p>To conclude:</p>



<p style="font-size:18px">Firstly, Spark is keeping repartitioned data in the same task is values of the columns used for repartitioning are the same. Secondly, if the amount of data in single task is too large for your use case you can repartition by more columns in&nbsp;<em>repartition</em>&nbsp;clause. Lastly, you can use salt for artificially splitting data into more tasks and keep better control over size of the data processed per task.</p>
<p>The post <a href="https://tantusdata.com/insights/spark-shuffle-case-3-using-salt-in-repartition/">Spark shuffle – Case #3 – using salt in repartition</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Spark shuffle – Case #2 – repartitioning skewed data</title>
		<link>https://tantusdata.com/insights/spark-shuffle-case-2-repartitioning-skewed-data/</link>
		
		<dc:creator><![CDATA[TantusData]]></dc:creator>
		<pubDate>Mon, 15 Oct 2018 08:48:00 +0000</pubDate>
				<category><![CDATA[repartition]]></category>
		<category><![CDATA[shuffle]]></category>
		<category><![CDATA[skew]]></category>
		<category><![CDATA[SPARK]]></category>
		<guid isPermaLink="false">http://tantusdata.local/?post_type=insights&#038;p=567</guid>

					<description><![CDATA[<p>In the&#160;previous blog entry&#160;we reviewed a Spark scenario where calling the&#160;partitionBy&#160;method resulted in each task creating as many files as you had days of events in your dataset (which was too much and caused problems). We fixed that by calling the&#160;repartition&#160;method. But will repartitioning your dataset always be enough? And is repartitioning always a good [&#8230;]</p>
<p>The post <a href="https://tantusdata.com/insights/spark-shuffle-case-2-repartitioning-skewed-data/">Spark shuffle – Case #2 – repartitioning skewed data</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://tantusdata.com/app/uploads/2023/03/article-8-header-1024x576.jpg" alt="Data repartitioning" class="wp-image-1387" srcset="https://tantusdata.com/app/uploads/2023/03/article-8-header-1024x576.jpg 1024w, https://tantusdata.com/app/uploads/2023/03/article-8-header-300x169.jpg 300w, https://tantusdata.com/app/uploads/2023/03/article-8-header-768x432.jpg 768w, https://tantusdata.com/app/uploads/2023/03/article-8-header-1536x863.jpg 1536w, https://tantusdata.com/app/uploads/2023/03/article-8-header-2048x1151.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p style="font-size:18px">In the&nbsp;<a href="http://tantusdata.com/spark-shuffle-case-1-partition-by-and-repartition/">previous blog entry</a>&nbsp;we reviewed a Spark scenario where calling the&nbsp;<em>partitionBy</em>&nbsp;method resulted in each task creating as many files as you had days of events in your dataset (which was too much and caused problems). We fixed that by calling the&nbsp;<em>repartition</em>&nbsp;method.</p>



<p style="font-size:18px">But will repartitioning your dataset always be enough? And is repartitioning always a good idea? Let’s assume you run the code form the previous blog entry (because it worked before, right?):</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="302" src="https://tantusdata.com/app/uploads/2023/01/hgdgjh-1024x302.jpg" alt="data code" class="wp-image-565" srcset="https://tantusdata.com/app/uploads/2023/01/hgdgjh-1024x302.jpg 1024w, https://tantusdata.com/app/uploads/2023/01/hgdgjh-300x88.jpg 300w, https://tantusdata.com/app/uploads/2023/01/hgdgjh-768x226.jpg 768w, https://tantusdata.com/app/uploads/2023/01/hgdgjh-1536x453.jpg 1536w, https://tantusdata.com/app/uploads/2023/01/hgdgjh.jpg 1724w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>And let’s say our dataframe (df) is read from HDFS. The data has the following characteristics:</p>



<ul class="wp-block-list">
<li>It’s stored in 2000 blocks, 128 MB each</li>



<li>It contains events for just for the past week.</li>
</ul>



<p>So there are 2000*128MB = 256 GB of data, which makes 256/7=36 GB per day. If we put the number on the shuffle diagram it would look like this:</p>



<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" src="https://tantusdata.com/app/uploads/2023/07/graph.spark_.shuffle-1024x816.jpg" alt="" class="wp-image-1699" style="width:1024px;height:816px" width="1024" height="816" srcset="https://tantusdata.com/app/uploads/2023/07/graph.spark_.shuffle-1024x816.jpg 1024w, https://tantusdata.com/app/uploads/2023/07/graph.spark_.shuffle-300x239.jpg 300w, https://tantusdata.com/app/uploads/2023/07/graph.spark_.shuffle-768x612.jpg 768w, https://tantusdata.com/app/uploads/2023/07/graph.spark_.shuffle-1536x1224.jpg 1536w, https://tantusdata.com/app/uploads/2023/07/graph.spark_.shuffle-2048x1632.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Notice that because you requested spark to repartition the data by day the entire day will end up being processed by a single task. That means there will be tasks processing at least 36GB of data (or more if you are unlucky and one task will process two days). That also means that there will be tasks doing nothing!</p>



<h2 class="wp-block-heading">Remember</h2>



<p>Each spark task is ran inside a JVM. And most likely your executor JVM will be smaller than 36 GB (and let&#8217;s ignore size differences that may depend on how data is represented, in memory vs disk, serialization etc.).</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="535" src="https://tantusdata.com/app/uploads/2023/01/article-8-picture-2-1024x535.jpg" alt="executor JVM" class="wp-image-1391" srcset="https://tantusdata.com/app/uploads/2023/01/article-8-picture-2-1024x535.jpg 1024w, https://tantusdata.com/app/uploads/2023/01/article-8-picture-2-300x157.jpg 300w, https://tantusdata.com/app/uploads/2023/01/article-8-picture-2-768x401.jpg 768w, https://tantusdata.com/app/uploads/2023/01/article-8-picture-2.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">Why do we have just a few tasks that are so heavy?</h2>



<p>Before, by partitioning the dataset we wanted to ensure that we write just single file per day. And limiting number of files in HDFS is a nice thing to do.&nbsp; But in this case creating just a single file per day means that a single executor tries hard to write tens of gigabytes of data at the same time. Does that mean it will fail? &nbsp;It’s likely, but not guaranteed to happen. It all depends on how your executors are configured and the operations you are performing. It’s likely that you will see messages like these:</p>



<pre class="wp-block-code"><code>INFO collection.ExternalSorter: Thread 58 spilling in-memory batch of 6511 B to disk (33 spills so far) 1</code></pre>



<p></p>



<p>What does that mean? It means that in order to avoid memory problems (It can’t keep everything in memory – it’s too large) a Spark task has to spill the data over to disk before actually writing it to HDFS. And the essential problem in this scenario is that it involves extra writing/reading to/from disk. This translates to ‘being slow’ in a Data Engineer’s language.</p>



<h3 class="wp-block-heading">How to fix the skew?</h3>



<p>It is great that we are trying to avoid writing too many small files. But that does not mean that we should try too hard. We don’t need to try writing 36 GB into a single file. Having files larger than 1GB is usually good enough. So how do you make make sure you split the load? One way is to choose an extra column to repartition by – a column like ‘<em>hour</em>‘.&nbsp;It’s cardinality&nbsp; is very low&nbsp;(1 – 24 hours) and it naturally splits your dataset into 24 buckets.&nbsp;Let’s run a code like this:</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" data-code="df
      .withColumn(&quot;year&quot;, year(col(&quot;eventTimestamp&quot;)))
      .withColumn(&quot;month&quot;, month(col(&quot;eventTimestamp&quot;)))
      .withColumn(&quot;day&quot;, dayofmonth(col(&quot;eventTimestamp&quot;)))
      .repartition(col(&quot;year&quot;), col(&quot;month&quot;), col(&quot;day&quot;), col(&quot;hour&quot;))
      .write
      .partitionBy(&quot;year&quot;, &quot;month&quot;, &quot;day&quot;)
      .save(output)" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">df</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">withColumn</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">year</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">year</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">eventTimestamp</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">withColumn</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">month</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">month</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">eventTimestamp</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">withColumn</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">day</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">dayofmonth</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">eventTimestamp</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">repartition</span><span style="color: #D8DEE9FF">(</span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">year</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">month</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">day</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">col</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hour</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">))</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">write</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">partitionBy</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">year</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">month</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">day</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">save</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">output</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p></p>



<p>When we run it our Spark UI looks like this:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="535" src="https://tantusdata.com/app/uploads/2023/01/article-8-picture-3-1024x535.jpg" alt="Spark UI" class="wp-image-1393" srcset="https://tantusdata.com/app/uploads/2023/01/article-8-picture-3-1024x535.jpg 1024w, https://tantusdata.com/app/uploads/2023/01/article-8-picture-3-300x157.jpg 300w, https://tantusdata.com/app/uploads/2023/01/article-8-picture-3-768x401.jpg 768w, https://tantusdata.com/app/uploads/2023/01/article-8-picture-3.jpg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">Why is it easier for Spark to rewrite these files?</h2>



<p>It’s quite simple. Repartition is an instruction which tells Spark to process the same values together. When we repartition by day we process an entire day in a single task. However, &nbsp;when we repartition by ‘<em>day</em>‘ and ‘<em>hour</em>‘ we process a single hour of a day in a single task which means there is significantly less data to handle.</p>



<p>How many files will you have? This time you have 7*24=168 files. Each file will be 1.5 GB on average, which is not bad. You probably won’t get the same number of events every hour but let’s not talk about it right now. As you can see, repartitioning can hurt performance if you misuse it.&nbsp;On the other hand, if you use it correctly it can help you achieve a more balanced data distribution between tasks. Moreover, it also allows you to better control the size of the files you write.&nbsp;</p>



<h2 class="wp-block-heading">What about using salting?</h2>



<p>And you might not like the idea of partitioning your data by ‘hour’ column. Or you may not even have a good column to use in your dataset. What do you do then? The short answer is ‘salting’. The long answer comes in the next blog post <em><a href="https://tantusdata.com/insights/spark-shuffle-case-3-using-salt-in-repartition/">Case #3</a></em>.&nbsp;</p>
<p>The post <a href="https://tantusdata.com/insights/spark-shuffle-case-2-repartitioning-skewed-data/">Spark shuffle – Case #2 – repartitioning skewed data</a> appeared first on <a href="https://tantusdata.com">TantusData</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
