LLM Archives - TantusData

Speeding up chatbot with tools

TantusData — Wed, 23 Jul 2025 11:37:50 +0000

OVERVIEW

Large Language Models (LLMs) can now use plug-ins to access extra tools. But they often respond slowly when these tools are used. This hurts the user experience.

In this post, we’ll show a simple trick that speeds up LLM responses. It will help you build more practical and efficient LLM-based solutions.

GOALS

Speed up bot answers.

EXAMPLE USE CASE

Let’s say we’re using OpenAI’s ChatGPT via langchain for our holiday rental website where people can talk with chat asking about our website’s current holiday offer or some general geography information. We leave the geography part solely to the LLM, but the one with showing our deals we have to implement on our end. Langchain gives this possibility by using tools.

Let’s say our chatbot function looks already like this:

def get_chatbot_answer(user_message: str) -> str

and the chat has a tool function that looks like this:

def get_holiday_offers(place: str, month: str) -> str

While they’re very easy to use and applicable in this case, the time spent processing tool responses by LLM can be huge. Huge enough to discourage some users from using our chatbot..

SOLUTION

We can divide our chatbot into 2 chatbots, one that will do the same as original, and the other that will decide if we’re using the first one, or the user just wants to get holiday offers (that way we don’t have to use the first, long one). In the other case, we can use the tool ourselves.

So, we can firstly create the first chatbot, that will return 3 outputs, (we can use structured output for that). It will look like this:

def chatbot_preprocessing(user_message: str) -> (bool, str, str)

and will return boolean if the user only wants to see holiday offers, and 2 next variables are place, and month respectively (we only care about those if the boolean is true). Now we can change a flow of our chat a bit:

def get_chatbot_answer_with_preprocessing(user_message: str) -> str:
	only_show_offers, place, month = chatbot_preprocessing(user_message)
	if only_show_offers:
		return get_holiday_offers(place, month)
	else:
		return get_chatbot_answer(user_message)

This way, if the user wants only offers (which is probably more than half the time), the chatbot answers very quickly, because it avoids processing the result by the LLM.

CONCLUSION

It’s very possible to speed up chat conversation not only by getting faster hardware, but also by using some tricks. That particular trick helps a lot with simple chatbots that are made to help website users to quickly see all the available products and services.

The post Speeding up chatbot with tools appeared first on TantusData.

What you need to know before deploying Open Source LLM

Bartek Sadlej — Sat, 19 Oct 2024 13:57:27 +0000

There are a few key questions which need to be thoroughly understood and answered before selecting a large language model to be used for building an application:

License – because you don’t want to end up in a legal trap
Expectations: accuracy, speed and cost tradeoffs
Understanding of benchmarks the model was evaluated on – so you don’t get surprised when evaluating the model with your users on your data
Deployment options – because building a PoC you run on your laptop is often far from production deployment.

License

This sounds easy; open source is open, as the name suggests. Well, not exactly. Ensure that the model you choose can be used as you want. For example, there is a statement in the Llama-2 license:

v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).

This means that if you start with Llama-2 or its fine-tuned successors and later, at some point in time, decide to switch to a different model, you are not allowed to use your historical data to train the new LLM. Or are you? If you were to modify one line of model code, it is no longer the original Llama Material and so on. In general, AI and code base product regulations are hard to assess and interpret, so it is probably safe to try finding a model with an Apache License or an even more permissive license first, if possible.

Define your expectation: accuracy, speed and cost tradeoffs.

It is tempting to dream big, especially for non-technical people who have seen the recent OpenAI Dev Day with the announcements of GPTs, Assistants and Google Gemini & Lumiere models. But in reality, meeting excessive expectations is challenging and often impossible. Going from 0% to 90% AI automation is difficult but doable; closing the gap between 90% and 100% is exceptionally demanding.

Think about Github Copilot. It won’t write a project for you, but its blazingly fast few-line completions, which usually require little adjusting, make engineers far more productive.

Ask the question: does my model need to figure out the nitty-gritty details? Or leave some space for users to interact and fill the missing gaps while creating a valuable product.

Maybe some parts of the pipeline can be postponed and implemented as batch jobs? The cost reductions might be significant in this case since closed model providers don’t offer lower prices for batch requests. In-house LLM allows you to process tasks offline in batches.

The recently released small Open Source LLM, such as Llama-2 and Mistral, or their further fine-tuned versions, like Zephyr and OpenHermes-2.5, are a perfect match for such a scenario. If you cannot compromise on accuracy, maybe there is a way to algorithmically fix weak spots.

On the other hand, it might be valuable to provide users with a few different model outputs or allow them to iterate and guide the suggestions quickly, such as with GithubCopilot. GPT-4 is powerful, but it would take minutes to call it a few times. Smaller models allow you to do such things. Recent features from Hugging Face and Nvidia can run Llama-v2-13b with an unbelievable speed of 1200 tokens per second.

Understand the benchmarks the model was evaluated on

When choosing the model, you will probably focus on its size, performance, and ‘the vibe’ – whether the model’s responses generally feel good. The performance is most often checked using the results of well-known benchmarks.

What are the weak spots of this approach?

First, the ML Labs releasing the model does not always publish the training data or even more precise information about what the model was trained on. It is often the case that we only see that ‘the model was trained on a well-curated corpus of X tokens’. And because those benchmarks are so popular, there is a possibility of some leakages into the training set. Not immediately the whole corpus, but for example, an automatic web crawl can contain conversations from Reddit or X/Twitter feeds about a particular task where people are discussing some parts of the benchmark.

Secondly, keep in mind that, in general, it is hard to benchmark written text automatically.

To uncover that, it is crucial to understand how each of those benchmarks is created and what it measures.

Let’s see an example question from one of the most popular ones, the MMLU (Massive Multitask Language Understanding):

Question: Glucose is transported into the muscle cell:

Choices:
A. via protein transporters called GLUT4.
B. only in the presence of insulin.
C. via hexokinase.
D. via monocarboxylic acid transporters.

Correct answer: A

And let’s ask ChatGPT:

Good? The answer is just “A”, so even though the model gets it, automatic evaluation would score it as a failure!

Without doing a deep dive into how the evaluation is actually done, there is an excellent blog on HuggingFace explaining it in detail; you should just know that it requires taking bare following tokens’ probabilities and using the model through code in a different way than you would interact with it through chat on some WebUI.

So, the key takeaway is that while those benchmarks provide us with a general ranking of models’ performance, one should pay close attention to how they are evaluated and whether this form of evaluation is meaningful for their use case.

For example, ChatGPT is a so-called Instruction-Finetuned model tuned to follow user instructions and interact with them. If you put a phrase:

Can you help me with that:
{arbitrary problem description}

It will very likely start it’s response with:

Certainly! {probably a good solution to your problem}

And if you were to check tokens probabilities for options A, B, C, and D from the above-mentioned MMLU example, as it is done in one implementation of MMLU, you would get C! But not because the model thinks the completion for the

Correct answer

Is C, but because it wants to start with Certainly!

Deployment options

Last but not least, let’s talk about inference. When you have chosen and maybe even fine-tuned your model further, it’s time to answer the question of what exactly you want to deploy and where.

The ‘what exactly’ part is essential. To start with, you probably have X billion parameters model in (b)float16. There are two options for improvement here: quantization and pruning.

Quantization converts some 16-bit weights into 8 or 4 bits so you can run the model on a smaller and cheaper GPU. Of course, by doing so, we lose some information and accuracy. It can be done automatically using some general formulas, or you can specify an evaluation dataset to quantize in a way that reduces some metrics the least.

It is important to note that currently, on most hardware, quantization reduces memory usage but reduces inference speed. Although the model weights are much smaller, some values must be cast back and forth. But it allows you to inference/fine-tune the model on cheaper hardware or just the available hardware since it might be hard for new players to get access to A100 & H100 clusters.

Both ways are available in the HuggingFace library and can be easily applied. Here, you can find the blog post going through their pros and cons and inference speed/memory comparison.

Pruning, on the other hand, works by completely removing some weights from the model.

It is important to remember that the transformer model under the hood does matrix multiplication, so you can just remove all entries close to zero and expect the performance to improve because it will cause some non-sequential memory accesses. A more gentle solution is needed. The PyTorch team has recently posted 2 blog posts about accelerating Generativ-AI, where they go into detail about available options.

Where to deploy?

Though for real-time chat applications, data centre deployment or on-premise, with high availability, there are some cost-saving techniques if you have offline steps in your data pipeline.

Currently, everyone runs LLM models on either A10 or A100 / H100, but surprisingly, not so many people know that cards from the RTX family are also a good performance choice for such applications.

Unfortunately, NVIDIA knows that, and they put the following statements in their license.

No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted.

But there are companies like vast.ai which offer RTX cards but with lower reliability than, for example, AWS ec2 instances, which you can use for offline data processing. The default filter for availability here is set to 90%, while on the AWS EC2 Service Level Agreement, commitment is 99.99%.

The post What you need to know before deploying Open Source LLM appeared first on TantusData.

Unleashing Innovation: A Glimpse into Our Exciting Event Journey

Magdalena Majka — Tue, 04 Jun 2024 09:52:16 +0000

Whether you’ve joined us in the past or are planning to attend our upcoming events, there’s always something exciting on the horizon. Let’s take a look at where we’ve been and where we’re headed next, so that you can gain valuable insights from attending our speeches, participating in our workshops, or exploring our other content. This season, our focus is on Large Language Models (LLMs) and Apache Spark, offering you a wealth of knowledge and practical skills to enhance your expertise. There’s a lot you can gain from engaging with our content and events, so don’t miss out!

Past Events: Highlights and Memories – Top Learning Places to Keep in Your Calendar for Next Year

Big Data Europe, November 2023

We were honored to host two insightful workshops: “ChatGPT, LLMs, and LangChains” and “Apache Spark Performance Tuning”. The event was a phenomenal success, and it was great meeting everyone there. If you missed it or want to relive the experience, check out our exclusive videos here.

SMG Data Summit, January 23-24, 2024

This two-day event, hosted by Google and organized by SMG Swiss Marketplace Group, was a deep dive into the world of data analytics and innovation. We led two workshops: “ChatGPT, LLMs, and LangChains” and “Deep Dive into AI & ML for CxOs, Managers, and Business Leaders”. It was a fantastic opportunity to enhance our data expertise. More details can be found here.

Warsaw IT Days, April 5-6, 2024

Celebrating its 15th anniversary, this iconic event gathered over 10,000 IT and Data Science enthusiasts. Our speeches, “AI Chats – What Nobody Told You: The Conundrums of Business Integration” and “Optimising Apache Spark and SQL for Improved Performance,” were well-received. Discover more about this event here.

If you missed any of these events, there’s still time to sign up for our upcoming ones or look out for next editions. We also invite you to explore the wealth of content already available on our blog and YouTube channel. Dive into our extensive library of articles, videos, and tutorials to stay updated and inspired.

Conf42 LLM, April 11, 2024

Marcin Szymaniuk will be diving into the complexities of integrating AI like ChatGPT into business frameworks at Conf42 LLMs. Learn how to optimize your resources for successful AI adoption. Don’t miss out – details here.

Big Data Technology Warsaw Summit, April 24-25, 2024

Join us for technical presentations, interactive roundtables, and networking opportunities with over 600 attendees. We’ll be leading a roundtable on “What Product Owners and Managers Should Know About ML and LLMs: The Challenges of Business Integration”. Find out more here.

Data Analytics Meeting, May 17-18, 2024

Our keynote, “A Short History of Data – Where Are We Aiming with AI?” will explore the evolution and future of data. This conference supports student development through discussions and networking. More information is available here.

Infoshare, May 22-23, 2024

The biggest tech and startup event in CEE, bringing together thousands of enthusiasts. We’ll be discussing “Optimising Apache Spark and SQL”. Don’t miss out – sign up here.

Jfokus, May 28, 2024

Join us at the Jfokus Training Camp in Stockholm for a focused one-day workshop on “LLMs and LangChains”. This is a great chance for hands-on learning and networking. Secure your spot here.

Conf42 Machine Learning, May 30, 2024

Marcin will share insights on Apache Spark SQL at this prestigious event. Check out the details here.

Upcoming Events: Join Us and Stay Ahead

Voxxed Days Luxembourg, June 20-21, 2024

We’re excited to deliver an LLM workshop at this developer-focused event. Learn more and register here.

PyCon Estonia, September 5-6, 2024

Join us for a workshop on LLMs at one of the largest Python gatherings in the Nordics. More details can be found here.

SREDAY, September 19-20, 2024

We will be offering an LLM workshop at this in-person conference in London. Stay tuned for more updates here.

Looking Ahead

We have more exciting events coming soon and will be adding them to this article as they are confirmed. Stay tuned for updates and new opportunities to connect, learn, and innovate with us.

Trending Topics: Apache Spark and LLMs

Two key topics that garnered significant interest at our events were Apache Spark and LLMs. Due to this high demand, we have dedicated articles on these subjects that are added to the recommendations below.

Stay tuned for more updates and see you at our next event!

The post Unleashing Innovation: A Glimpse into Our Exciting Event Journey appeared first on TantusData.

LLMs can make your business soar or sink, so tread carefully.

Marcin Szymaniuk — Tue, 05 Dec 2023 12:28:30 +0000

What should a Business Leader know before investing in AI?

ChatGPT is everywhere. Chances are, you’re already using it for everyday tasks – its language skills are unmatched. Using it to fix spelling in your email is really easy, but let’s imagine transforming it into a frontline warrior – a chatbot that not only chats but also solves customer problems. We will use this business application of a chatbot as an example in this article.

Applying chatbots to more complex business processes is very tempting but can be full of pitfalls. There are many aspects to consider when building such an application. From the privacy of your data to the cost of implementation. And from the correctness of the responses provided by chat to the maintenance in the future. In this article, we will take all the challenges one by one, describe them and give a guideline on how to approach them pragmatically. This leads to building a successful project, providing a return on investment.

What is a Large Language Model? What LLM is not?

Remember the last time you chatted with ChatGPT? The conversation was smooth, and it even knew who the US president was. But pose a niche question about, say, ‘how to change the invoice number in my system,’ and it might stumble. Even more, it might confidently tell you something off-base.

I bring that up because it’s essential to understand that ChatGPT or any other general LLM is not a general source of knowledge. Picture this: You wouldn’t ask a history professor about advanced rocket science, right? Similarly, out of the box, an LLM may not ace a deep dive into your specific business.

But it becomes a potent tool when you plug it with information about your domain and instruct it on what to do with that knowledge. A tool that might appear intelligent.

Getting correct answers

Imagine AI as a car and data as its fuel. Without the right fuel, it won’t take you where you want. Suppose you’ve set up your AI for customer support, and someone asks, ‘How do I get the invoice?’. Instead of the AI drawing a blank or guessing, it should know exactly where to fetch the answer – like a librarian who knows which shelf a book is on.

Data is the fuel for AI. For the chat to answer questions specific to your organisation, you need to provide it with the relevant information. Let’s say you want it to serve customer support purposes. Let’s assume the user asks the system, ‘How do I get the invoice?’

The most standard approach to this problem is building an application that searches for relevant information and provides it to the chat. In our case, the application seeks information about the invoicing process. All searching is done in vector databases, which are great tools for finding the appropriate information in text documents. No wonder they’re becoming more popular since the rise of LLMs. Once it finds the clue, it hands it over to our chat.

Let’s stop here and amplify the message – general purpose LLM does not know much about your business. Here’s the gist: While AI is great with words, it only knows the specifics of your business if you tell it. Feed it the correct details to transform it from a general chat buddy to a helpful assistant. Think of it as training a new employee.

In the simple case described above, most of the ‘magic’ is done by vector databases. But in some cases, just providing the information ‘on the fly’ is not good enough. For the LLM to perform better, you must fine-tune the vector embeddings or the LLM model itself. This is more advanced, so we will describe it in a separate article.

Privacy concerns

Imagine sending a personal letter and having someone else read it. That happens when you use ChatGPT or similar tools – the data goes to their home base. No worries if you’re chatting about the weather. But what if it’s confidential customer details? That’s where things get tricky.

Picture a customer sharing personal info on your support chat. Now, where does that data go? Outside your walls? And are you breaking any rules by letting it?

Confirming the legal aspects is the minimum you should do. But if you are not allowed to send the data to third parties, you need to consider privately hosted LLM so your data never goes outside your data centre. The selection of open-source LLMs available to use as privately hosted is broad. Broad selection means you have many options, but at the same time, you need expertise to make wise selections. Especially since the market is hot and keeping up with all the changes is challenging. The right decision requires careful consideration of aspects like the model’s performance on your hardware and the ability to scale up and down if needed.

A cherry on top? Going private means you aren’t tied down by vendor lock-in. You can move your infrastructure to your own data centre or cloud. It’s a critical point in the context of mitigating the risk of a vendor making drastic changes in the pricing or in the offered service itself.

Cost and defining the scope of the project

When building a house, you wouldn’t start with the roof or fancy decor, right? You’d plan the foundation and budget from day one. It’s the same with an AI. If your supplier changes the rules, you’ve got expenses like engineering time, hardware, cloud, and possible surprise fees.

When building an AI application based on Language models, it’s essential to control the project’s scope and expected costs from day one.

It’s easy to dream big. After all, you’ve got all the tools at your fingertips. But to deliver the business value, you need to analyse the problem you are solving and set realistic milestones for the AI project. And it’s essential to start with issues which are common and easy to solve. E.g. When building a customer support application, it makes sense to begin by addressing issues which are repeated over and over again. You’d know them if you chat with your support team or skim through past tickets. If you approach the problem correctly, it’s likely that after two or three iterations with your product, you realise that you have solved 80% of the issues. Suddenly, you’ve got a solid ROI and can decide if you want to add those fancy trims or if your house is good enough.

Maintenance

Keep in mind that the more complex the application you are building, not only the implementation cost but also the cost of using it in the future increases – the maintenance and the price you pay for hardware or API calls. The maintenance might include aspects such as deployment complexity, testing, reacting to changes introduced by third parties or feeding your application with newer data.

That’s another reason to think early about the scope and be flexible during the implementation. So, before diving in, sketch out your costs. Ask yourself: How much is each chatbot chat worth to you? How much would you pay to help one customer if it’s a support bot? What conversion rate justifies the effort and cost of maintaining a chatbot if it’s for shopping? Knowing your numbers helps you steer clear of unwanted surprises.

Summary

As outlined above, it’s critical to understand the challenges when selecting the scope of the LLM project. But with a clear understanding of potential roadblocks from the beginning, you’re paving the way to the ROI you expect. The spectrum of challenges in implementing LLM for solving business problems in your organisation is vast, and you should:

Ensuring correct behaviour makes it essential to provide your LLM with correct, well-structured data.

Addressing privacy concerns means understanding the legal prerequisites. Remember that you have options of hosting LLM in your Data Center, which simplifies the legal aspects but introduces extra technical challenges.

Budgeting for a project means considering all costs: development, API charges, and maintenance. It’s important to verify the assumptions after every milestone of the project and make sure the scope and the cost are under control.

Thinking ahead is essential, so estimate maintenance efforts. Ensure these efforts align with the expected benefits of its integration into your organisation.

The post LLMs can make your business soar or sink, so tread carefully. appeared first on TantusData.

NeMo-Guardrails

Bartek Sadlej — Tue, 05 Dec 2023 12:10:04 +0000

Building a dedicated chatbot is both challenging and dangerous. At company X, the model should talk about X’s offer and, ideally, nothing else to save cost, not block throughput, and be sure not to insult anyone. It would also be nice to meet all of those requirements while not sacrificing the chatbot’s performance.

The field of LLM-powered bots is new and rapidly evolving, so many different solutions have emerged, but one of them caught our attention: Nvidia NeMo-Guardrails. Its core value is the ability to define rails to guide conversations while being able to connect an LLM to other services seamlessly and securely.

You can check out how to get started using the examples and user guide on its GitHub page, but since it is very new and, at the time of this writing, the current release is alpha 0.5, there are not many resources online on how to build more complex applications. At TantusData, we’ve been using it a lot recently and want to share a few practical tips.

Agenda:

How to make it work with a model of your choice
Multiple bot actions and responses per one user message and output formatting
Two chat histories: one for displaying to the user, different to guide the model
General tips

How to make it work with a model of your choice

We will use Mistal7BInstruct to illustrate that. The advantage of using this open-source model is that it comes with an official Docker image, which you can use to self-host it, and the API schema follows the one from OpenAI, so it is super easy to integrate it. You can also use this Docker with any other model from HuggingFace. If it is a gated one, such as Llama-2, remember to run Docker with -e HF_TOKEN=… to get access.

There are two things to cover here—connection to the model and prompting.

The connection consists of two parts: config and implementation. The bare minimum implementation follows LangChain LLM interface, which should be put in `config.py` file with an additional line registering it in guardrails:

# config.py
import openai

class Mistral7BInstruct(LLM):
model: str
endpoint_url: str
# also useful to define: temperature ~ 0.0, max_tokens ~ 2K, frequency_penalty ~ 1.

def _call(
self,
prompt: str,
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> str:

openai.api_key = None
openai.api_base = self.endpoint_url

response = openai.Completion.create(
model = self.model,
prompt = prompt,
stop = stop,
**kwargs
)

return response.choices[0].text

@property
def _identifying_params(self):
...

@property
def _llm_type(self):
return {}

register_llm_provider("my_engine_name", Mistral7BInstruct)

Then what you need to do is specify the engine and parameters in `config.yml` file.

models:
- type: main
engine: my_engine_name
parameters:
model: mistralai/Mistral-7B-Instruct-v0.1
endpoint_url: ...

The next thing to cover is prompts.

By now, NeMo-Guardrails works best with `text-davinci-003` (first chat GPT). More recent OpenAI models expect different prompts to create structured output, whereas the OpenSource model needs more strict instructions on what to do; they won’t automatically spot the pattern in two examples and follow.

The main challenge is generating user intent given the current input and definitions provided in `*.co` files. There are prompts for some already implemented and general ones that will be used if the engine is not explicitly implemented. The problem with them is that they lack explicit instruction on what to do, and as we noticed, usually less powerful models, instead of following the intent pattern, go ahead and try to respond to user input.

The solution for mistral is to include explicit instruction.

- task: generate_user_intent
content: |-
"""
{{ general_instruction }}
You must write only user intent as shown in the example. Do not respond to the user. Do not write anything else.
"""
...

Multiple bot actions and responses per one user message and output formatting

There are situations when we want to execute more than one action or value extraction per round and combine all outputs into the final response. The reason not to just write a wrapper function which will do everything at once is the ability to later easily filter or modify some parts from history, which gets automatically inserted into the model prompt. I will cover that in the next section.

Also, when you want to include some line breaks for better formatting, in the default definition… it will either not be visible or displayed as ‘\n’ instead and all bot messages from one round will simply get concatenated.

Now, let’s look at how an example flow might look like:

define bot answer_with_cited_document provide answer
"Answer: $answer_with_cited_document \n\n"

define bot matches_in_db found matches
"Found matches: $matches_in_db"

define flow answer with cited documents
user ask question
$answer_with_cited_document = ...
bot $answer_with_cited_document provide answer
$cited_documents = ... 
$matches_in_db = execute db_search(cited_documents=$cited_documents)
bot $matches_in_db inform found matches

Here, the printed bot answer after concatenation will look as follows:

“Answer: $answer_with_cited_document \n\nFound matches: $matches_in_db”

The way to achieve better formatting might look like this:

define bot formatted_answer print answer
"$formatted_answer"
define flow answer with cited documents
user ask question
$answer_with_cited_document = ...
$cited_documents = ... 
$matches_in_db = execute db_search(cited_documents=$cited_documents)
$formatted_answer = execute format_answer(ans=$answer_with_cited_document, docs=$matches_in_db)
bot formatted_answer print answer

Two chat histories: one for displaying to the user, different to guide the model

The most straightforward reason we must do something with history at some point is the fact that LLMs have limited context. But apart from that, one should understand what gets inserted into the model’s prompt and whether you are not wasting tokens unnecessarily.

By default, NeMo Guardrails inserts the action output into the prompt with such format:

execute db_search
# The result was /* Full result returned from action here */

If our db_search returns a massive Json we have a problem. In the long run, it will fill up the context, but even before that, it can distract the model from paying attention to relevant parts.

It depends on the particular use case, but if all you want to do is display the results with, e.g. links and scores when left unchanged, the search results will be inserted into the prompt twice, once after action execution and the second time as a final bot answer if you use additional action for output formatting even thrice!

We can take advantage of filters to adjust that.

In general prompts, you can find templates like this:

# prompts_general.yml

{{ history | colang }}

— which takes the whole history and parses it into the prompt in colang.

To filter or modify some events, one can add a custom filter in such a way:

# config.py

def modify_actions(events: List[dict]) -> List[dict]:
events = deepcopy(events)

# filter formatting since we will see the exact same string as a final bot answer
events = [event for event in events if not (event['type'] == 'InternalSystemActionFinished' and event['action_name'] == 'format_answer')]

for event in events:
if event['type'] == 'InternalSystemActionFinished' and event['action_name'] == "your_action_name_here":

event['return_value'] = modify event['return_value']])

# filter formatting since we will see the exact same string as the final bot answer

return events

def init(llm_rails: LLMRails):
llm_rails.register_filter(modify_actions, "modify_actions")

And then use it in prompts like this:

# prompts_general.yml

{{ history | modify_actions | colang }}

We use deepcopy because Python’s dictionary modifications like my_dict[‘key’] = val modify the variable passed to function, and without it in later chat rounds, we would have to check whether the value is already modified or not.

Sometimes, it does make sense to clean up the whole history. For example, a user intends to start from the beginning and send a new request. Without history cleaning, previously entered information might produce incorrect prompts and cause irrelevant search results. To achieve that, we define the following flow in the colang file:

define user start new search
    "I'd like to start a new search"
    "May I look for something different"
    "I want to try another conditions"
    "Forget all I asked before"

After that, you can enhance modify_actions method with the following extract:

0: history.pop() return history" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button">

 history = []
    for event in events:
        history.append(event)
        if event['type'] == "UserIntent" and event['intent'] == "start new search":
            while len(history) > 0:
                history.pop()
    return history

General tips

Also, when working with NeMo Guardrails you may find those tips useful.

Use chat mode instead of server when developing. It makes errors easier to spot and highlights output in verbose mode
Take advantage of Python’s logging module. Guardrails print a lot in verbose mode, and configuring different files as output for different modules makes reading much more convenient.
When using a custom LLM, explicitly log its inputs and outputs as this is the most fragile part of Guardrails. If your model is not following the colang pattern for getting the user intent you can’t move forward.

The post NeMo-Guardrails appeared first on TantusData.

What if the data is too large for the LLM context?

Bartek Sadlej — Thu, 28 Sep 2023 11:00:00 +0000

In the previous article, we covered extracting information from unstructured data. However, this is just the tip of the iceberg. Another problem can arise when you have long documents which don’t fit into the embedding model context length. The natural move in such a situation is splitting the documents into multiple parts. Another reason for using this technique is when the entire document does not create good enough embeddings. Last but not least, you might want to extract smaller chunks in order to lower the token usage.

The subject of the document or paragraph is usually at the beginning of the section. It does not show up in the latter parts of the document, so it is likely that when we just split the document into multiple parts, we end up with lots of documents that lack contextual information, for example.

Subscription prices for 1 month:
20 USD / month


Subscription prices for 1 year:
200 USD / year

In the snippet above, we see the price, but we lack information about what the price is for (for TV subscription, for broadband subscription)

When we query a vector database and provide the result to the LLM application, we will likely see that this document seems relevant to TV, broadband or mobile subscription requests. The reason is that we get a high cosine similarity score for any query related to the subscription price. So here we go:

from langchain.chains import RetrievalQA
from langchain.docstore.document import Document
from langchain.schema.retriever import BaseRetriever
from langchain.chat_models import ChatOpenAI


class ConstRetriever(BaseRetriever):
   def _get_relevant_documents(self, *args, **kwargs) -> List[Document]:
       return [doc]
llm = ChatOpenAI(model_name="gpt-4")
retriever = ConstRetriever()
qa = RetrievalQA.from_llm(llm, retriever=retriever)


offers = ["TV", "Internet", "Car", "Gym membership"]


for offer in offers:
   res = qa(f"What is the {offer} subscription price for one year?")['result']
   print(res)

The subscription price for one year is 200 USD.
The subscription price for one year is 200 USD.
The context does not provide information on the car subscription price for one year.
The context provided does not specify what the subscription prices are for, such as a gym membership. Therefore, I can’t provide the exact price for a gym membership subscription for one year.

All those queries have ~0.85-0.9 cosine similarity with the example document. This document is ‘close enough’ and gets provided as input to the LLM. The LLM then has to decide how to answer the question. Moreover, if you think about it, the document content is not enough to say what the price is for, so the best you can expect is to say ‘I don’t know’ so at least it does not make information up, which it does not have in the first place. And that answer is not satisfying anyway – we do have the information about the prices, and we would like to chat to answer it. We just have to find a better way of providing it with the correct information.

How do we tackle this problem?

After splitting, the most straightforward solution is to include additional context for each part. For example, we can add “Details for the TV offer:” if those prices come from such an offer. It helps with solving hallucination problems, but the similarity score may remain high for such documents, which can prevent the retriever from fetching the most relevant documents. The model will answer that it does not have enough context to answer the question.

Another solution is to include document metadata and use a feature called self-query.

Here, instead of including context information directly in the document text, we set it as an additional filtering index and use LLM to produce a relevant query.

The difference is that even though the document has a high similarity score, it will not get fetched, and the retriever can provide genuinely relevant data sources. In other words, instead of relying on vector store to provide relevant documents only by their embeddings’ similarity to the query, we add an extra index and provide the LLM with its description. The model can then decide whether to use it and with what arguments.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo


embeddings = OpenAIEmbeddings()


page_content="""
Subscription prices for 1 month:
20 USD / month


Subscription prices for 1 year:
200 USD / year
"""


docs = [
   Document(
       page_content=page_content,
       metadata={
           "product": "TV",
       },
   ),
]
vectorstore = Weaviate.from_documents(
   docs, embeddings, weaviate_url="http://127.0.0.1:8080"
)


metadata_field_info = [
   AttributeInfo(
       name="product",
       description="The name of the product for which the subscription prices are",
       type="string",
   ),
]
document_content_description = "Details for all the products offers"
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0., verbose=True)
retriever = SelfQueryRetriever.from_llm(
   llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)
qa_with_self_query = RetrievalQA.from_llm(llm, retriever=retriever, return_source_documents=True)


for offer in ["TV", "Internet", "Car", "Gym membership"]:
    res = qa_with_self_query(f"What is the {offer} subscription price for one year?")
    print(f"n docs: {len(res['source_documents'])}, answer: {res['result']}")

This is the result produced by the code above:

query='TV subscription price' 
filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='TV') limit=Nonen docs: 1, 
answer: The TV subscription price for one year is 200 USD.
—-----------------------------------------------
query='Internet subscription price' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='Internet') limit=None
n docs: 0, answer: I'm sorry, but I don't have access to specific pricing information …
query='Car subscription price' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='Car') limit=None
n docs: 0, answer: I'm sorry, but I don't have enough information …
query='Gym membership subscription price' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='Gym membership') limit=None
n docs: 0, answer: I'm sorry, but I don't have access to specific pricing information for gym memberships. …

The drawback of this approach is that it is significantly more expensive because additional calls are needed to provide this functionality. With LangChain OpenAICallback, we can easily monitor the API usage, and for the first solution, it is ~ 0.00015 $ per question, whereas for the second ~ 0.0015 $, so x10 increase.

We also have to keep in mind that creating metadata for splitted documents might not be trivial and may need human supervision.

All things considered, it’s not a surprise that LLM will be as good as the data you provide to it – the more detailed and relevant information you can provide, the higher the chance of getting a good response. Self-querying is a powerful technique which might be useful in the project you are working on. The exact decision on how to provide metadata and whether we should use self-querying depends on a specific business problem to be solved.

The post What if the data is too large for the LLM context? appeared first on TantusData.

What Data Format is suitable for LLM?

Bartek Sadlej — Tue, 26 Sep 2023 09:23:17 +0000

Unpacking LLM: From Hype to Reality

Since ChatGPT and the recent release of Llama-v2 models, it is becoming increasingly popular to build context-aware LLM applications. One such use case is Question Answering over documents. Many focus on cool prototype examples where, after feeding lots of Wikipedia articles to the vector database, one can make sure that Joe Biden is indeed the president of the United States. Only a few focus on current limitations and unsolved problems, which make it challenging to arrive at production-ready applications.

At TantusData, we have been paying close attention to finding weak spots that need to be solved to provide desired functionality. In the upcoming articles, we will be presenting them. In this article, we will start with the challenges related to the data format.

The code examples currently use the most popular library for creating LLM applications: LangChain.

Data Format

Imagine that you are building a chatbot to answer users’ questions based on company offers.

The question we will be testing is:

“What are the prices for the internet subscription?”

Let’s assume for now that our database contains the document with relevant data:

TV Prices:


Subscription prices for 1 month:
L - 40 USD / month
M - 30 USD / month
S - 20 USD / month


Subscription prices for 1 year:
L - 400 USD / year
M - 300 USD / year
S - 200 USD / year


Internet Prices:


Subscription prices for 1 month:
L - 50 USD / month
M - 25 USD / month
S - 10 USD / month


Subscription prices for 1 year:
L - 500 USD / year
M - 250 USD / year
S - 100 USD / year


Phone prices:


Subscription prices for 1 month:
L - 18 USD / month
M - 12 USD / month
S - 6 USD / month


Subscription prices for 1 year:
L - 180 USD / year
M - 120 USD / year
S - 60 USD / year

And when we provide it as a context to the question, we get the desired answer:

from langchain.chains import RetrievalQA
….


doc = Document(page_content=...)


class ConstRetriever(BaseRetriever):
   def _get_relevant_documents(self, *args, **kwargs) -> List[Document]:
       return [doc]


llm = ChatOpenAI(model_name="gpt-3.5-turbo")
retriever = ConstRetriever()
qa = RetrievalQA.from_llm(llm, retriever=retriever)
print(qa("What are the prices for the internet subscription?")['result'])

The code results with the following answer from the chat:

The prices for the internet subscription are as follows:

1 month:
L - 50 USD / month
M - 25 USD / month
S - 10 USD / month

1 year:
L - 500 USD / year
M - 250 USD / year
S - 100 USD / year

Great, it worked. So we are good to go? Well, not really. The hidden problem is that we usually don’t have the relevant documents in such a nice text format. Usually, the data comes from scraping web pages or parsing PDFs, and it might be originally displayed as a table.’When you think about it – the reasons are often quite natural. The idea often comes from a business unit which would like to limit customer service efforts. Customer service works with these documents – they are easy for humans to read. So, when we just mimic what a person does with the document, we might put ourselves in a tricky situation.

What we can do is pick one of the available loaders in LangChain, but we can end up with a text which is not so convenient to read by humans. Maybe it will be good enough for the model? Let’s see.

Let’s look at the example of a table in a pdf document.

Service	Period	Subscription	Price
TV	Month	S	20 USD
		M	30 USD
		L	400 USD
	Year	S	200 USD
		M	300 USD
		L	400 USD
Internet	Month	S	10 USD
		M	25 USD
		L	50 USD
	Year	S	100 USD
		M	250 USD
		L	500 USD
Phone	Month	S	6 USD
		M	12 USD
		L	18 USD
	Year	S	60 USD
		M	120 USD
		L	180 USD

Table 1

When we try to extract the text information from the table, the output depends on the pdf loader we selected:

UnstructuredPDFLoader	PDFMinerLoader	PDFPlumberLoader	PyPDFLoader
Service Period month TV year month Internet year month Phone year Subscription Price 20 USD 30 USD 400 USD 200 USD 300 USD 400 USD 10 USD 25 USD 50 USD 100 USD 250 USD 500 USD 6 USD 12 USD 18 USD 60 USD 120 USD 180 USD	Service Period Subscription Price TV Internet Phone month year month year month year S M L S M L S M L S M L S M L S M L 20 USD 30 USD 400 USD 200 USD 300 USD 400 USD 10 USD 25 USD 50 USD 100 USD 250 USD 500 USD 6 USD 12 USD 18 USD 60 USD 120 USD 180 USD	Service Period Subscription Price S 20 USD month M 30 USD L 400 USD TV S 200 USD year M 300 USD L 400 USD S 10 USD month M 25 USD L 50 USD Internet S 100 USD year M 250 USD L 500 USD S 6 USD month M 12 USD L 18 USD Phone S 60 USD year M 120 USD L 180 USD	Service Period Subscription Price TVmonthS 20 USD M 30 USD L 400 USD yearS 200 USD M 300 USD L 400 USD InternetmonthS 10 USD M 25 USD L 50 USD yearS 100 USD M 250 USD L 500 USD PhonemonthS 6 USD M 12 USD L 18 USD yearS 60 USD M 120 USD L 180 USD

table 2 version selected

As you probably noticed, there are significant differences in the results.

It is probably not what we would expect. However, let’s check if the model can still get the correct answer:

loader	gpt-3.5-turbo	gpt-4
UnstructuredPDFLoader	The prices for the internet subscription are as follows: – 20 USD per month – 200 USD per year – 400 USD for 2 years	The text doesn’t provide specific prices for an internet subscription.
PDFMinerLoader	The prices for the internet subscription are as follows: – Small (S) package: $20 per month or $200 per year – Medium (M) package: $30 per month or $300 per year – Large (L) package: $40 per month or $400 per year	The prices for the internet subscription are: – Small (S) size: 200 USD per month / 100 USD per year – Medium (M) size: 300 USD per month / 250 USD per year – Large (L) size: 400 USD per month / 500 USD per year
PDFPlumberLoader	The prices for the internet subscription are as follows: – S: 100 USD per year or 6 USD per month – M: 250 USD per year or 12 USD per month – L: 500 USD per year or 18 USD per month	The prices for the internet subscription are: For the S plan: 100 USD per year or 6 USD per month For the M plan: 250 USD per year or 12 USD per month For the L plan: 500 USD per year or 18 USD per month
PyPDFLoader	The prices for the internet subscription are as follows: – For the S (Small) plan: – Monthly subscription: 10 USD – Yearly subscription: 100 USD – For the M (Medium) plan: – Monthly subscription: 25 USD – Yearly subscription: 250 USD – For the L (Large) plan: – Monthly subscription: 50 USD – Yearly subscription: 500 USD	The prices for the internet subscription are: For a month: – S: 10 USD – M: 25 USD – L: 50 USD For a year: – S: 100 USD – M: 250 USD – L: 500 USD

table 3 version selected

As we can see, only one of the four loaders managed to parse the file in a way the chat could understand it.

That is why it is often not straightforward to create a reliable data source for a question-answering model, and one should carefully investigate the format because it usually needs to be corrected. And it is easy to be overlooked because it does it silently when it fails.

In summary – What can we do about the situation described?

First of all, careful engineering and spotting that the problem exists is a must – if you expected some shortcuts, I’m sorry to disappoint you. It’s very easy to build a system that hallucinates very convincingly.
Plan to test the result with domain experts.
Very likely, the pdf is created out of some billing system database, and relying on it will simplify the data extraction and make it much easier to follow updates.
Even if you rely on a database, you still should have data quality checks in place.
Last but not least – remember that Generative AI does not solve all possible computer science problems. In the case above, we are struggling with searching for documents in the first place. This problem still needs careful engineering – only when you provide some reasonable structure can you benefit from the LLM magic – human-like interactions and understanding how the data fits into the question being asked.

The post What Data Format is suitable for LLM? appeared first on TantusData.