Insights - TantusData

Speeding up chatbot with tools

TantusData — Wed, 23 Jul 2025 11:37:50 +0000

OVERVIEW

Large Language Models (LLMs) can now use plug-ins to access extra tools. But they often respond slowly when these tools are used. This hurts the user experience.

In this post, we’ll show a simple trick that speeds up LLM responses. It will help you build more practical and efficient LLM-based solutions.

GOALS

Speed up bot answers.

EXAMPLE USE CASE

Let’s say we’re using OpenAI’s ChatGPT via langchain for our holiday rental website where people can talk with chat asking about our website’s current holiday offer or some general geography information. We leave the geography part solely to the LLM, but the one with showing our deals we have to implement on our end. Langchain gives this possibility by using tools.

Let’s say our chatbot function looks already like this:

def get_chatbot_answer(user_message: str) -> str

and the chat has a tool function that looks like this:

def get_holiday_offers(place: str, month: str) -> str

While they’re very easy to use and applicable in this case, the time spent processing tool responses by LLM can be huge. Huge enough to discourage some users from using our chatbot..

SOLUTION

We can divide our chatbot into 2 chatbots, one that will do the same as original, and the other that will decide if we’re using the first one, or the user just wants to get holiday offers (that way we don’t have to use the first, long one). In the other case, we can use the tool ourselves.

So, we can firstly create the first chatbot, that will return 3 outputs, (we can use structured output for that). It will look like this:

def chatbot_preprocessing(user_message: str) -> (bool, str, str)

and will return boolean if the user only wants to see holiday offers, and 2 next variables are place, and month respectively (we only care about those if the boolean is true). Now we can change a flow of our chat a bit:

def get_chatbot_answer_with_preprocessing(user_message: str) -> str:
	only_show_offers, place, month = chatbot_preprocessing(user_message)
	if only_show_offers:
		return get_holiday_offers(place, month)
	else:
		return get_chatbot_answer(user_message)

This way, if the user wants only offers (which is probably more than half the time), the chatbot answers very quickly, because it avoids processing the result by the LLM.

CONCLUSION

It’s very possible to speed up chat conversation not only by getting faster hardware, but also by using some tricks. That particular trick helps a lot with simple chatbots that are made to help website users to quickly see all the available products and services.

The post Speeding up chatbot with tools appeared first on TantusData.

LLMs and LangChain – Getting Started Guide

TantusData — Thu, 17 Apr 2025 14:10:06 +0000

During their workshop at Big Data Conference Europe, Marcin and Bartek guided participants through the essentials of GPT-powered LLM applications. By the end of the session, attendees had a solid grasp of developing with Chains and powerful reasoning Agents, integrating external APIs and context sources, and building solutions that excel in question-answering over documents—all with a focus on accuracy and safety.

What it covered:

Foundations of LLMs: Participants discovered which Large Language Models best suited different needs and how to apply them effectively.
LangChain Essentials: They gained a solid understanding of the LangChain API and learned how Chains and Agents can drive dynamic, intelligent behavior in applications.
Context and Integrations: Attendees explored how to integrate external APIs, pass context between components, and leverage vector databases for advanced features.
Question Answering with Your Own Data: They used embeddings and combined ChatGPT, VectorDb, and LangChain to develop robust question-answering systems.
Reasoning Agents: Participants examined how Agents can utilize real-time tools such as Google Search or Wolfram Alpha for powerful, on-the-fly problem-solving.
Accuracy and Safety: They delved into techniques for self-querying, hallucination checks, and output moderation to ensure reliable application performance.
Tuning and Production: Finally, they got a glimpse into real-world deployment, from optimizing embeddings to managing inference and costs.

Missed the Session? Don’t Worry!

Couldn’t make it to our workshop at Big Data Conference Europe? We’ve got you covered. We can bring the same interactive experience straight to your team.

By the End of This Workshop, You Will:

Gain a strong understanding of ChatGPT-powered LLM applications.
Master the LangChain API to build and orchestrate Chains and Agents for complex decision-making.
Develop real-world applications integrating external APIs and data.
Build reasoning Agents that tackle dynamic problems in real-time.
Implement crucial techniques for application safety and accuracy.
Combine ChatGPT, VectorDb, and LangChain into powerful question-answering systems.
Level up your AI skills and transform your creative ideas into functional, future-ready solutions.

Hands-On Agenda

Introduction
- Overview of various LLMs – what’s good for your use case?
- Introduction to LangChain
Building with Chains and Agents
- LangChain API
- Passing context between components
- Integrations with external APIs and data sources
- Comparing Chains vs. Agents
Question Answering Over Documents
- Introduction to embeddings
- Overview of vector databases
- Converting documents to vectors (plus common gotchas)
- Building a simple application with ChatGPT, VectorDb, and LangChain
Creating Powerful Reasoning Agents
- Dynamic decision-making
- Integrations with tools (e.g., Google Search, Wolfram Alpha)
Summary & Best Practices
- Techniques for improving accuracy and safety: self-querying, hallucination checks, and output moderation
- LLM in production: key challenges and how to tackle them
- Potentials for tuning: embeddings, inference optimization, cost management

Ready for a Custom Workshop?

Looking to tailor this session to your organization’s specific needs? We can help. Contact us to explore how we can design a workshop that unleashes the power of generative AI for your projects. Let’s build the future together!

The post LLMs and LangChain – Getting Started Guide appeared first on TantusData.

Vendor lock-in when selecting a Cloud Data Platform architecture.

Marcin Szymaniuk — Wed, 09 Apr 2025 13:13:19 +0000

Migrating Data Platform, data warehouse or data lake?

When deciding to move your data to the cloud, many people focus on costs, expected gains, easier maintenance, or simplified development. Sometimes, lower costs or easier maintenance drive the decision. However, it’s crucial to also consider the potential cost of exiting the cloud. What happens if, at some point, you decide you no longer want to be on a particular cloud platform? This could happen due to rising costs, new technology options, or even legal or political reasons.

Did you know that the exit-cost of your data platform from a Cloud might be a more expensive project than an original migration to the Cloud?

If you plan to migrate your data platform to the cloud, thinking about future migration costs now is a sign of responsible migration. Understand what will be the cost in terms of dollars, time, and effort if at some point you decide to exit that specific cloud vendor. This means recognizing that exit costs aren’t always obvious and that you’ll need to consider the various aspects of vendor lock-in.

What exactly are the risks associated with vendor lock-in:

Long term costs rise. If you don’t have an easy to move alternative you are at risk of becoming a hostage.
Lack of flexibility – even if your solution is great now you always risk that it will not be developed in the future.
High transfer fees. Most clouds are charging you if you transfer data from it (egress fee). That needs to be calculated when planning a migration from a specific cloud vendor.
Contractual agreements – Vendor lock-in is not only about proprietary technology or data formats. What’s in the contract might be another trap which might become painful in the future.
Lost optimization opportunities – there is no such thing as free lunch. Cloud solutions usually are easier to start with but your engineers might lose ability to do low level tuning if you even need that.

When planning your cloud migration, don’t overlook the long-term exit costs – ask your provider about compatibility of specific technology with other tools in the market.

What can you do to mitigate the risk:

Evaluate the technology and alternatives – consider if there are open-source alternatives to your chosen data warehousing solution or if it’s compatible with other vendors
If you decide on proprietary technology make sure the cost of in and expected exit cost is justified by what you gain (usually easier, faster development)
Consider hybrid cloud. It’s more expensive but gives you more flexibility in the long run.
Consider favouring well-known standards and open source technologies and data formats. A good example is Kubernetes – it’s a technology which you can have on your own servers as well as with any major cloud providers.

Considering moving to the cloud providers, ping me on Linkedin (https://www.linkedin.com/in/marcin-szymaniuk/) for specific calculations.

Summary

The data space is moving fast.

Always ask yourself what is the risk that within 5 years you have to migrate again.

Always ask yourself what will happen if you can’t use the selected data platform anymore. How long notice would you need to migrate to another solution?

The post Vendor lock-in when selecting a Cloud Data Platform architecture. appeared first on TantusData.

What you need to know before deploying Open Source LLM

Bartek Sadlej — Sat, 19 Oct 2024 13:57:27 +0000

There are a few key questions which need to be thoroughly understood and answered before selecting a large language model to be used for building an application:

License – because you don’t want to end up in a legal trap
Expectations: accuracy, speed and cost tradeoffs
Understanding of benchmarks the model was evaluated on – so you don’t get surprised when evaluating the model with your users on your data
Deployment options – because building a PoC you run on your laptop is often far from production deployment.

License

This sounds easy; open source is open, as the name suggests. Well, not exactly. Ensure that the model you choose can be used as you want. For example, there is a statement in the Llama-2 license:

v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).

This means that if you start with Llama-2 or its fine-tuned successors and later, at some point in time, decide to switch to a different model, you are not allowed to use your historical data to train the new LLM. Or are you? If you were to modify one line of model code, it is no longer the original Llama Material and so on. In general, AI and code base product regulations are hard to assess and interpret, so it is probably safe to try finding a model with an Apache License or an even more permissive license first, if possible.

Define your expectation: accuracy, speed and cost tradeoffs.

It is tempting to dream big, especially for non-technical people who have seen the recent OpenAI Dev Day with the announcements of GPTs, Assistants and Google Gemini & Lumiere models. But in reality, meeting excessive expectations is challenging and often impossible. Going from 0% to 90% AI automation is difficult but doable; closing the gap between 90% and 100% is exceptionally demanding.

Think about Github Copilot. It won’t write a project for you, but its blazingly fast few-line completions, which usually require little adjusting, make engineers far more productive.

Ask the question: does my model need to figure out the nitty-gritty details? Or leave some space for users to interact and fill the missing gaps while creating a valuable product.

Maybe some parts of the pipeline can be postponed and implemented as batch jobs? The cost reductions might be significant in this case since closed model providers don’t offer lower prices for batch requests. In-house LLM allows you to process tasks offline in batches.

The recently released small Open Source LLM, such as Llama-2 and Mistral, or their further fine-tuned versions, like Zephyr and OpenHermes-2.5, are a perfect match for such a scenario. If you cannot compromise on accuracy, maybe there is a way to algorithmically fix weak spots.

On the other hand, it might be valuable to provide users with a few different model outputs or allow them to iterate and guide the suggestions quickly, such as with GithubCopilot. GPT-4 is powerful, but it would take minutes to call it a few times. Smaller models allow you to do such things. Recent features from Hugging Face and Nvidia can run Llama-v2-13b with an unbelievable speed of 1200 tokens per second.

Understand the benchmarks the model was evaluated on

When choosing the model, you will probably focus on its size, performance, and ‘the vibe’ – whether the model’s responses generally feel good. The performance is most often checked using the results of well-known benchmarks.

What are the weak spots of this approach?

First, the ML Labs releasing the model does not always publish the training data or even more precise information about what the model was trained on. It is often the case that we only see that ‘the model was trained on a well-curated corpus of X tokens’. And because those benchmarks are so popular, there is a possibility of some leakages into the training set. Not immediately the whole corpus, but for example, an automatic web crawl can contain conversations from Reddit or X/Twitter feeds about a particular task where people are discussing some parts of the benchmark.

Secondly, keep in mind that, in general, it is hard to benchmark written text automatically.

To uncover that, it is crucial to understand how each of those benchmarks is created and what it measures.

Let’s see an example question from one of the most popular ones, the MMLU (Massive Multitask Language Understanding):

Question: Glucose is transported into the muscle cell:

Choices:
A. via protein transporters called GLUT4.
B. only in the presence of insulin.
C. via hexokinase.
D. via monocarboxylic acid transporters.

Correct answer: A

And let’s ask ChatGPT:

Good? The answer is just “A”, so even though the model gets it, automatic evaluation would score it as a failure!

Without doing a deep dive into how the evaluation is actually done, there is an excellent blog on HuggingFace explaining it in detail; you should just know that it requires taking bare following tokens’ probabilities and using the model through code in a different way than you would interact with it through chat on some WebUI.

So, the key takeaway is that while those benchmarks provide us with a general ranking of models’ performance, one should pay close attention to how they are evaluated and whether this form of evaluation is meaningful for their use case.

For example, ChatGPT is a so-called Instruction-Finetuned model tuned to follow user instructions and interact with them. If you put a phrase:

Can you help me with that:
{arbitrary problem description}

It will very likely start it’s response with:

Certainly! {probably a good solution to your problem}

And if you were to check tokens probabilities for options A, B, C, and D from the above-mentioned MMLU example, as it is done in one implementation of MMLU, you would get C! But not because the model thinks the completion for the

Correct answer

Is C, but because it wants to start with Certainly!

Deployment options

Last but not least, let’s talk about inference. When you have chosen and maybe even fine-tuned your model further, it’s time to answer the question of what exactly you want to deploy and where.

The ‘what exactly’ part is essential. To start with, you probably have X billion parameters model in (b)float16. There are two options for improvement here: quantization and pruning.

Quantization converts some 16-bit weights into 8 or 4 bits so you can run the model on a smaller and cheaper GPU. Of course, by doing so, we lose some information and accuracy. It can be done automatically using some general formulas, or you can specify an evaluation dataset to quantize in a way that reduces some metrics the least.

It is important to note that currently, on most hardware, quantization reduces memory usage but reduces inference speed. Although the model weights are much smaller, some values must be cast back and forth. But it allows you to inference/fine-tune the model on cheaper hardware or just the available hardware since it might be hard for new players to get access to A100 & H100 clusters.

Both ways are available in the HuggingFace library and can be easily applied. Here, you can find the blog post going through their pros and cons and inference speed/memory comparison.

Pruning, on the other hand, works by completely removing some weights from the model.

It is important to remember that the transformer model under the hood does matrix multiplication, so you can just remove all entries close to zero and expect the performance to improve because it will cause some non-sequential memory accesses. A more gentle solution is needed. The PyTorch team has recently posted 2 blog posts about accelerating Generativ-AI, where they go into detail about available options.

Where to deploy?

Though for real-time chat applications, data centre deployment or on-premise, with high availability, there are some cost-saving techniques if you have offline steps in your data pipeline.

Currently, everyone runs LLM models on either A10 or A100 / H100, but surprisingly, not so many people know that cards from the RTX family are also a good performance choice for such applications.

Unfortunately, NVIDIA knows that, and they put the following statements in their license.

No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted.

But there are companies like vast.ai which offer RTX cards but with lower reliability than, for example, AWS ec2 instances, which you can use for offline data processing. The default filter for availability here is set to 90%, while on the AWS EC2 Service Level Agreement, commitment is 99.99%.

The post What you need to know before deploying Open Source LLM appeared first on TantusData.

RDD in Apache Spark

Amadeusz Kosik — Fri, 28 Jun 2024 07:33:40 +0000

Do you want to see the number of partitions? Or the partition size in rows from within the job? Or maybe you just like some low-level hacking? The RDD API is still there and accessible via the .rdd method.

Why is RDD API still around?

Despite deprecating the RDD API, the engine of Apache Spark (at least its Open Source part – see our article on Photon), the RDDs are still used in its internals. It is also available via the .rdd method of Datasets, and, therefore, DataFrames as well. Keep in mind, though, that the number of actually useful operations not available from Dataset API is really low, and currently, excluding some low-level or Spark internals hacking, it boils down to partitions count and size checking – using getNumPartitions method or glom operator, respectively.

Check number of partitions

RDD API still keeps the getNumPartitions method for that use case:

println(inputData.rdd.getNumPartitions)

Check number of rows per partition

The glom function coalesces all rows in each partition into an array. It can be used to check the number of rows per partition. In the case of wide rows, consider using select to limit the number of columns—RDD are not optimized by most of Spark’s mechanisms.

inputData.rdd.glom().map(_.length).collect().foreach(println _)

The post RDD in Apache Spark appeared first on TantusData.

Datasets and DataFrames

Amadeusz Kosik — Tue, 11 Jun 2024 06:41:52 +0000

With the deprecation of the public use of the old good RDD API, Spark users are left with two options: typed Datasetsand untyped DataFrames (that are actually a specific case of Datasets). The API also allows users to freely cast one to another – e.g. using the .as[T] method to cast an untyped DataFrame to a Dataset[T]. It does not change the underlying data though and can result in surprising results if one is not aware of that.

What does .as[T] do?

Let’s start by looking at the source (code):

Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:
When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark. sql. caseSensitive).
When U is a tuple, the columns will be mapped by ordinal (i. e. the first column will be assigned to _1).
When U is a primitive type (i. e. String, Int, etc), then the first column of the DataFrame will be used.
If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
Note that as[] only changes the view of the data that is passed into typed operations, such as map(), and does not eagerly project away any columns that are not present in the specified class.
Since:
1.6.0

The last note is crucial: casting to a Dataset does not change the underlying data. Any columns not present in the T (e.g., columns without a corresponding field in the case class) will not be discarded.

Why bother?

There are a few situations where having extra columns may be surprising and create problems with a job run (or even worse – silently introduce data quality issues):

running union or unionAll transformations on non-aligned data,
calling distinct (it will check for hidden columns’ uniqueness as well),
saving data (will include extra columns).

A defensive version of .as[T]

The simple version of a defensive (meaning: adjusting the schema to the provided domain class) would be one with a .select() transformation call:

case class Artist(id: String, name: String, location: String)

def toArtistsDefensive(input: DataFrame): Dataset[Artist] = { input
  .select("id", "name", "location")
  .as[Artist]
}

This is a very DRY-unfriendly implementation, as each modification of the Artists class requires searching for all related select instances and updating them. Fortunately, with a bit of reflection, it can be refactored into a generic solution. This generic transformation will trim the Dataset to contain only the expected columns.

import scala.reflect.runtime.universe._

def toTDefensive[T <: Product: TypeTag](input: DataFrame): Dataset[T] = { 
  val caseClassFields = typeOf[T].members
    .collect { case m: MethodSymbol if m.isCaseAccessor => m.name.toString }
    .toSeq
  
  val columns = caseClassFields
    .map(F.col _)
    .reverse

  input
    .select(columns: _*)
    .as[T]
}

The post Datasets and DataFrames appeared first on TantusData.

Unleashing Innovation: A Glimpse into Our Exciting Event Journey

Magdalena Majka — Tue, 04 Jun 2024 09:52:16 +0000

Whether you’ve joined us in the past or are planning to attend our upcoming events, there’s always something exciting on the horizon. Let’s take a look at where we’ve been and where we’re headed next, so that you can gain valuable insights from attending our speeches, participating in our workshops, or exploring our other content. This season, our focus is on Large Language Models (LLMs) and Apache Spark, offering you a wealth of knowledge and practical skills to enhance your expertise. There’s a lot you can gain from engaging with our content and events, so don’t miss out!

Past Events: Highlights and Memories – Top Learning Places to Keep in Your Calendar for Next Year

Big Data Europe, November 2023

We were honored to host two insightful workshops: “ChatGPT, LLMs, and LangChains” and “Apache Spark Performance Tuning”. The event was a phenomenal success, and it was great meeting everyone there. If you missed it or want to relive the experience, check out our exclusive videos here.

SMG Data Summit, January 23-24, 2024

This two-day event, hosted by Google and organized by SMG Swiss Marketplace Group, was a deep dive into the world of data analytics and innovation. We led two workshops: “ChatGPT, LLMs, and LangChains” and “Deep Dive into AI & ML for CxOs, Managers, and Business Leaders”. It was a fantastic opportunity to enhance our data expertise. More details can be found here.

Warsaw IT Days, April 5-6, 2024

Celebrating its 15th anniversary, this iconic event gathered over 10,000 IT and Data Science enthusiasts. Our speeches, “AI Chats – What Nobody Told You: The Conundrums of Business Integration” and “Optimising Apache Spark and SQL for Improved Performance,” were well-received. Discover more about this event here.

If you missed any of these events, there’s still time to sign up for our upcoming ones or look out for next editions. We also invite you to explore the wealth of content already available on our blog and YouTube channel. Dive into our extensive library of articles, videos, and tutorials to stay updated and inspired.

Conf42 LLM, April 11, 2024

Marcin Szymaniuk will be diving into the complexities of integrating AI like ChatGPT into business frameworks at Conf42 LLMs. Learn how to optimize your resources for successful AI adoption. Don’t miss out – details here.

Big Data Technology Warsaw Summit, April 24-25, 2024

Join us for technical presentations, interactive roundtables, and networking opportunities with over 600 attendees. We’ll be leading a roundtable on “What Product Owners and Managers Should Know About ML and LLMs: The Challenges of Business Integration”. Find out more here.

Data Analytics Meeting, May 17-18, 2024

Our keynote, “A Short History of Data – Where Are We Aiming with AI?” will explore the evolution and future of data. This conference supports student development through discussions and networking. More information is available here.

Infoshare, May 22-23, 2024

The biggest tech and startup event in CEE, bringing together thousands of enthusiasts. We’ll be discussing “Optimising Apache Spark and SQL”. Don’t miss out – sign up here.

Jfokus, May 28, 2024

Join us at the Jfokus Training Camp in Stockholm for a focused one-day workshop on “LLMs and LangChains”. This is a great chance for hands-on learning and networking. Secure your spot here.

Conf42 Machine Learning, May 30, 2024

Marcin will share insights on Apache Spark SQL at this prestigious event. Check out the details here.

Upcoming Events: Join Us and Stay Ahead

Voxxed Days Luxembourg, June 20-21, 2024

We’re excited to deliver an LLM workshop at this developer-focused event. Learn more and register here.

PyCon Estonia, September 5-6, 2024

Join us for a workshop on LLMs at one of the largest Python gatherings in the Nordics. More details can be found here.

SREDAY, September 19-20, 2024

We will be offering an LLM workshop at this in-person conference in London. Stay tuned for more updates here.

Looking Ahead

We have more exciting events coming soon and will be adding them to this article as they are confirmed. Stay tuned for updates and new opportunities to connect, learn, and innovate with us.

Trending Topics: Apache Spark and LLMs

Two key topics that garnered significant interest at our events were Apache Spark and LLMs. Due to this high demand, we have dedicated articles on these subjects that are added to the recommendations below.

Stay tuned for more updates and see you at our next event!

The post Unleashing Innovation: A Glimpse into Our Exciting Event Journey appeared first on TantusData.

TantusData Recognised as a Clutch Global Leader for Spring 2024

Magdalena Majka — Mon, 27 May 2024 08:00:50 +0000

Clutch hasannounced its recognition of TantusData as a 2024 Spring Global Award winner for Qlik, Hadoop, Tableau, Big Data Compliance, Fraud, & Risk Management services on Clutch, the leading global marketplace of B2B service providers.

Honourees are selected based on their industry expertise and ability to deliver scores that are calculated based on the client feedback from thousands of reviews published on Clutch. TantusData is honoured to be recognised as a 2024 Spring Clutch Global Award winner. This award is a testament to the excellent client work we have delivered this year as recognised through the voice of our customers in their reviews on Clutch. We’re proud to be recognised as aleader on a global scale. Clutch Global Awards showcases the very best in the B2B services industry worldwide.

“We are incredibly proud to receive the 2024 Spring Clutch Global Award. This recognition is a direct reflection of our team’s hard work and dedication to delivering outstanding results for our clients. It’s an honour to see our efforts acknowledged on such a prestigious platform, and we are motivated to continue setting high standards in the industry.”
Marcin Szymaniuk, CEO of TantusData

Since joining Clutch, TantusData has been delighted to receive multiple recognitions from the platform. Previously, we were listed as a leading service provider on Clutch, being named one of Poland’s industry game-changers in big data analytics. Read more about this recognition here.

“It is a joy to witness the incredible success of leading companies worldwide on our platform, and an even greater joy to recognise these companies as Clutch Global honourees. Their dedication to delivering next-level services to clients has not only bolstered their own success but empowered numerous clients to thrive as well. In recognising this spring’s Clutch Global honourees, we aim to showcase industry leaders and encourage connections for Clutch users seeking tailored services to achieve their goals.”
Sonny Ganguly, Clutch CEO

View our recent work and reviews on our Clutch profile.

If you’re interested in hiring us for your next project, get in touch with us here.

The post TantusData Recognised as a Clutch Global Leader for Spring 2024 appeared first on TantusData.

Navigating Big Data Solution Adoption: A Managerial Guide

Magdalena Majka — Tue, 21 May 2024 13:45:29 +0000

A strategic Overview for Decision Makers

Entering the realm of big data solutions marks a transformative step for any organisation, demanding a blend of strategic foresight, meticulous planning, and a deep dive into technological possibilities. This guide is crafted for the astute manager and decision maker poised at the brink of this significant transition, often navigating without extensive prior knowledge of big data technologies. You’re not just choosing a technological path; you’re setting the course for how data can redefine your organizational landscape. From preparation to engagement, this guide demystifies the journey, enabling you to chart a course that not only aligns with your organisational goals but propels them to new heights.

Laying the Groundwork for Transformation

Deciphering Your Big Data Ambitions

Understanding and articulating your needs and objectives is the cornerstone of a successful big data strategy. This clarity of purpose guides you through the myriad options and solutions available, ensuring that your choices resonate with the unique contours of your organizational goals. Dive into this analysis by considering two primary dimensions:

Addressing Pains, Problems, and Bottlenecks 🛠️

Your first beacon is the existing landscape of challenges that hinder your operational efficiency or strategic growth. Big data solutions offer a powerful lens to not only view these issues in high resolution but to navigate through them with precision. Whether it’s through enhancing decision-making capabilities, streamlining operations, or uncovering insights buried within your data, the aim here is to dismantle these barriers, paving the way for a smoother organizational journey.

Identify Specific Challenges: Begin with a granular analysis of operational pain points, inefficiencies, or data-related hurdles your organization faces.
Blueprint for Resolution: Map out how big data technologies could potentially offer solutions, from predictive analytics to advanced data management systems.

Optimising Potential Use Cases 💡

Beyond the immediate resolution of problems, big data opens a vista of opportunities to enhance, innovate, and lead. This proactive stance looks at not just where you are, but where you could be. It’s about leveraging data to its fullest potential, identifying areas where analytics, machine learning, and data-driven strategies can introduce efficiencies, innovations, and competitive advantages.

Innovation and Improvement: Pinpoint areas within your organization where data analytics could bring about transformative change or offer a competitive edge.
Predictive and Prescriptive Analytics: Explore how utilizing big data for forecasting and strategic advice could streamline operations, enhance customer experience, or optimize resource allocation.

By integrating these two approaches, you create a comprehensive picture of your big data aspirations, capturing both the remedial and the visionary. This dual lens not only addresses where you stand today but positions you to leap forward, turning potential into reality. Remember, the most effective big data solutions often serve a dual purpose: rectifying existing issues while unlocking new opportunities.

Enhance Your Understanding Through Our Complimentary Webinar

We recognise the complexity of navigating big data solutions and are committed to empowering decision-makers. Our complimentary webinars delve into the wide-ranging applications of big data and the essential considerations for decision-makers, without promoting our services. These informative sessions are tailored to your organization’s specific interests and needs, aiming to bolster your big data strategy. Contact us to arrange a webinar designed to empower your team with critical insights.

Exclusive Video Resources

In addition to our webinars, we offer concise, impactful video guides on crucial topics:

Mastering Machine Learning Projects: Explore the essentials of leading Machine Learning (ML) projects, designed for managers with minimal data backgrounds. This guide simplifies ML project management, covering team roles, goal setting, data quality, and more. Watch the vid eo.
Understanding Data Quality: Dive into the significance of data quality, a critical factor in the success of any project. This Q&A video addresses common questions, highlighting the risks, benefits, and responsibilities associated with data quality. Explore the video.

Assessing Your Current Infrastructure

Understanding your current technological landscape is crucial for identifying the need for big data solutions. This involves a thorough inventory and assessment of your existing data systems, technologies, and infrastructure. Here’s how to approach this step effectively:

Conduct an Inventory: Engage with your IT department or technology team to catalogue all current data systems, software, and technologies in use. If your organization lacks a dedicated IT team, consider consulting with external IT professionals or technology auditors.
Identify Limitations: Work closely with your IT team to highlight any challenges or limitations in your current setup, including data processing capabilities, storage issues, or scalability concerns. This step is vital for understanding the gaps that big data solutions can fill.
Consult Various Departments: Different departments may utilize distinct systems or have unique insights into the limitations of the current infrastructure. Ensure to gather input from across the organisation for a holistic view.

Establishing a Budget Framework

Setting a realistic budget for big data solutions requires an understanding of both your financial capacity and the market costs of these technologies. Here’s how to navigate this:

Assess Financial Capabilities: Review your organization’s financial resources, considering both current budgets and potential future investments. Collaboration with your finance department is key to accurately determining your spending capacity.
Research Market Costs: Given the variability and complexity of big data solution pricing, gathering reliable cost information can be challenging. To navigate this, consider:
- Consulting with Industry Peers: Reach out to your professional network or industry groups to gain insights into the costs they have encountered.
- Leverage Expert Guides: To receive our comprehensive budget guide detailing price ranges for various big data projects, simply send us an email with the subject ‘Budget Guide’. This indispensable resource is tailored to assist you in setting realistic expectations aligned with current market standards. By requesting this guide, you’ll gain insight into budgeting for your big data initiatives and will be automatically subscribed to our newsletter, keeping you informed on the latest trends and insights in big data solutions. Please note that subscribing to our newsletter is part of this process, offering you ongoing value beyond the guide.

Understanding Your Data

A deep understanding of your organization’s data is fundamental to selecting the right big data solution. This involves:

Evaluating Data Characteristics: Work with your data management team or data scientists to assess the volume, variety, velocity, and veracity of the data your organization handles. If such expertise is not available in-house, consider engaging with data consultants.
Considering Compliance and Privacy: Ensure you are aware of any industry-specific data privacy or regulatory compliance requirements. This might involve consulting with your legal department or external legal advisors specializing in data protection laws.

Engaging Your Stakeholders

Incorporating stakeholder input is crucial for ensuring alignment and support for big data initiatives. This step should be integrated with identifying your needs and objectives to ensure a cohesive strategy:

Gather Diverse Insights: Engage with stakeholders from various departments to compile a comprehensive list of needs, expectations, and concerns. This can include executives, department heads, and end-users who will interact with the big data solutions.
Facilitate Collaborative Discussions: Organize workshops or meetings that bring together these stakeholders to discuss and refine the objectives and expectations for implementing big data solutions. This collaborative approach ensures that the project aligns with organizational goals and addresses key pain points effectively.

By following these detailed steps, you’ll be well-equipped to prepare for engaging with big data solution providers, ensuring that your organisation’s needs are accurately identified and that you have a solid foundation for making informed decisions.

Initiating Contact with Providers

Engaging with the right big data solution providers is a critical step towards implementing a system that aligns with your organization’s needs. Here’s how to navigate this process effectively:

Research and Shortlist Providers

To find providers that not only offer big data solutions but also understand your industry’s unique challenges, follow these steps:

Utilize Industry Forums and Reviews: Platforms like Clutch are invaluable for identifying reputable providers, offering a wealth of user reviews and ratings. This firsthand feedback can give you insights into the providers’ strengths and weaknesses.
Evaluate Expertise and Scalability: Look for providers that have demonstrated expertise in your industry. Those recognized as experts often contribute to the community by speaking at events, leading workshops, or publishing insightful case studies. This involvement can be a good indicator of their capability to scale solutions to meet your needs.
Explore Provider Portfolios: Delve into the websites of potential providers to gather detailed information about their services, client success stories, and case studies. This exploration can offer a clearer picture of their capabilities and the types of solutions they’ve successfully implemented in the past.
Network for Recommendations: Tap into your professional network for personal insights and recommendations. Firsthand accounts of experiences with big data solution providers can be invaluable in shortlisting candidates.

Reach Out

Once you have a shortlist of potential providers, reaching out effectively is key:

Prepare a Brief: Compile a concise overview of your organization, including its needs, objectives, and any specific challenges you hope to address with big data solutions. This brief should summarize the critical information providers need to understand your project.
Initiate Contact: Use the contact information on providers’ websites to share your brief. This initial outreach is your opportunity to make a strong first impression and lay the groundwork for productive discussions. It’s efficient to send this brief to multiple providers to broaden your options.

Schedule Initial Consultations

Engaging in consultations with several providers allows you to compare and contrast their approaches, expertise, and the solutions they offer:

Consult with Multiple Providers: Aim to schedule initial consultations with 3-5 providers. This range gives you a broad perspective while remaining manageable. Each provider may offer unique insights, benefits, or approaches that could be crucial for your project.
Communicate Your Availability: Be clear about when you’re available for these discussions. Providing a range of dates and times can help expedite scheduling.
Inquire About Scope and Involvement: During scheduling, ask providers to outline what will be covered during the initial consultation. Also, inquire about the estimated timeline for project completion and the level of involvement required from your team. Understanding these factors upfront can help you prepare for the consultations more effectively.

By thoroughly researching potential providers, preparing a clear and concise brief, and strategically scheduling initial consultations, you’re setting the stage for productive engagements that will lead to identifying the best big data solution for your organization. This process ensures that you have all the information needed to make an informed decision, aligning with your strategic objectives and operational needs.

Making Your Selection

After engaging with potential providers and receiving their proposals, the next step is a critical analysis to determine the best fit for your organization. This phase involves careful consideration of how each proposal aligns with your strategic needs, budgetary constraints, and long-term objectives. Here’s a structured approach to making your selection:

Compare Proposals Against Your Criteria

Align with Needs and Objectives: Systematically compare each proposal’s offerings against the specific needs and objectives you’ve identified for your big data project. This includes considering how well each solution addresses your current challenges and facilitates your strategic goals.
Budget Compatibility: Assess each proposal in the context of your budget framework. Consider not only the upfront costs but also any ongoing expenses associated with the solution, such as licensing fees, data storage costs, and additional services.
Strategic Goals Alignment: Evaluate how each provider’s solution aligns with your broader strategic goals. This could involve considerations of innovation potential, market competitiveness, and the ability to adapt to future challenges.

Evaluate Long-term Benefits and Scalability

Scalability: Ensure the proposed solution can grow and adapt as your organization’s needs evolve. This includes the ability to handle increasing volumes of data, integrate new data sources, and expand in functionality.
Support and Maintenance: Consider the level and quality of support and maintenance services offered. Reliable post-implementation support is crucial for resolving any issues that arise and ensuring the longevity of your big data solution.
Long-term Value: Beyond the immediate solution, assess the long-term benefits each provider offers. This could include ongoing updates, training opportunities for your team, and access to additional resources or community support.

Achieve Stakeholder Consensus

Engage Stakeholders: Present the analyzed proposals to your key stakeholders, highlighting how each aligns with the organization’s needs and objectives. This presentation should be detailed and include your recommendations based on the evaluation criteria.
Facilitate Discussion: Encourage open discussion among stakeholders to hear their perspectives, concerns, and preferences. This collaborative approach ensures that the decision-making process is inclusive and considers the insights and needs of different departments or teams.
Consensus Building: Work towards building a consensus on the provider that best matches your organization’s vision, objectives, and budget. This may involve negotiating compromises or revisiting specific criteria to ensure alignment with organizational priorities.

Finalise Your Selection

Once a consensus is achieved, proceed with finalizing your selection of the big data solution provider. This involves reaching out to the chosen provider to begin the negotiation of terms, finalizing the contract, and planning the implementation phase. Ensure that all legal and financial considerations are thoroughly reviewed and agreed upon before finalizing the agreement.

By following this structured approach, you ensure that your organization selects a big data solution provider that not only meets your current needs but also positions you for future success. This careful selection process paves the way for a fruitful partnership and a successful big data initiative.

The Consultation Process

Engaging with potential big data solution providers through consultations is a pivotal part of the selection process. Here’s what you should expect and how to navigate these meetings effectively:

What to Expect:

Initial Meeting: This first encounter is an opportunity to discuss your organization’s specific needs and objectives in detail and to receive an initial overview of the provider’s solutions. Approach this meeting as a chance to see how well the provider understands your challenges and whether they have the expertise to address them.
Follow-Up Meetings: These sessions often delve deeper into the technical aspects of the proposed solutions. Expect to engage in discussions about customizing solutions to fit your needs, presentations from technical specialists, and possibly even demos of the solutions. This phase is crucial for assessing the provider’s capability to tailor their offerings to your requirements.
Proposal Review: After these discussions, providers will present a formal proposal. This document should outline the solution in detail, including implementation timelines, costs, and the framework for how the solution will be integrated into your existing infrastructure.

Key Questions to Ask:

To ensure you gather all the necessary information to make an informed decision, consider asking the following questions:

Industry Experience: “Can you share examples of your work with similar companies or within our industry?”
Solution Alignment: “How do your solutions specifically address our needs and objectives?”
Case Studies and References: “Could you provide case studies or references from past projects similar to ours?”
Implementation Timeline: “What is the expected timeline for the full implementation of your solution?”
Pricing Structure: “Can you provide a detailed explanation of your pricing structure, including any potential additional costs we should anticipate?”
Data Security and Compliance: “How does your solution ensure data security and compliance with relevant regulations?”
Support and Maintenance: “What kind of post-implementation support and maintenance services do you offer?”

Preparing Your Data for Consultation

To effectively engage with big data solution providers, having a clear understanding of your data landscape is essential. Here’s how to prepare:

Data Inventory: Conduct a comprehensive review of the types of data your organization processes. This task is often best managed by your IT department or data management team. If such expertise is not readily available within your organization, consider hiring external data consultants to assist.
- Actionable Steps: Start by listing all data sources, types, and storage locations. Categorize data based on sensitivity, usage frequency, and format. This inventory will be crucial for providers to understand the scope of your data and how their solutions can be applied.
Identify Data Challenges: Clearly outline any issues your organization faces regarding data management, analysis, or security. This step is crucial for finding a solution that addresses these challenges effectively.
- Who to Consult: This information might be spread across various departments, such as IT for technical challenges, legal for compliance issues, and operational teams for workflow-related challenges. Consolidate this information to present a comprehensive view of the data challenges to the provider.
Growth Projections: Anticipate the future scale of your data needs to ensure the proposed solution can accommodate growth. This includes estimating increases in data volume, variety, and velocity.
- Collaboration is Key: Engage with strategic planning or business development teams to forecast future data growth based on business projections. This foresight ensures the solution remains viable as your organization grows.

By thoroughly preparing for the consultation process and clearly understanding your data landscape, you set the stage for productive discussions with potential providers. This preparation enables you to ask informed questions, accurately assess proposals, and ultimately choose a solution that best fits your organisation’s needs.

Conclusion

By following this guide and leveraging the provided resources, you’ll be well-equipped to engage with big data solution providers effectively. The path to a successful partnership and implementation is paved with preparation, open communication, and a deep understanding of your organizational goals. Our webinars and video resources are here to further enhance your knowledge and confidence in navigating the big data landscape.

The post Navigating Big Data Solution Adoption: A Managerial Guide appeared first on TantusData.

Monitoring Airflow jobs with TIG 2: data quality metrics

Amadeusz Kosik — Tue, 30 Apr 2024 13:31:53 +0000

In the first article on Monitoring Airflow jobs with TIG, “System Metrics”, we have seen an example of Airflow installation with a TIG stack set up to monitor it. To fully utilize this stack, we should enrich the raw system metrics with statistics on the processed data. Without this, the metrics would tell if the data pipelines are doing anything but not whether they are working on the correct data.

What to look for?

What can be realistically monitored is a pretty deep topic without a one-fits-all answer. The safe starting point is to look for the size of the data, duplicates (or unique rows), null/missing columns’ values and basic aggregates (count per some enumerated type or min/max values). The nice part of this issue is it is not limited by any software, and you can report any numeric value into an InfluxDB database.

Equally important is not limiting the monitoring to the output of the whole pipeline only. Being able to check the data volume and basic traits on the input and in intermediate steps is crucial, as it enables one to check, identify and react to problems early on (and avoid painful backtracking and recomputing of the whole pipeline).

An example data metrics dashboard showing row count, unique row count, and null rows for three steps in the imaginary data pipeline: load, process, and export. It is available to run a local demo on our GitHub.

Computing the metrics

Technically speaking, such monitoring requires two things in the pipeline: a code (or job) to compute the metric and a wrapper to send it to the metrics database. We did not cover the former here – it may vary from a simple SQL query run via Hive / Impala to a side output of a Spark job.

Storing the data for graphs

The second part to be done in Airflow is sending the data to the database. At the time of writing this article, the built-in InfluxDB connector allows only querying the database. Please see our demo (especially the plugins directory) for an example implementation of InfluxDB write. You can also use the REST API or BashOperator to call the influx command there.

Merging both steps or not?

Both compute and send metrics steps may be squashed into a single bash step instead of scheduling them separately and stitching them via XComs. However, the more complicated or time-consuming the calculation may be, the better the separated approach would seem. This is a decision for you to make; we provide an example of the former approach.

Summary

After the first step, the example stack has monitoring of the system, and an operator can see whether the system is working and does not have an overload or some kind of bottleneck. This step adds a base monitoring of the data quality. Adding those on multiple points in the data pipeline will also enable verification during the processing – in a centralized place (or, in this case, WebUI). Once again, a development/demo environment is available on our GitHub.

The post Monitoring Airflow jobs with TIG 2: data quality metrics appeared first on TantusData.