The technicalities Archives - TantusData

What you need to know before deploying Open Source LLM

Bartek Sadlej — Sat, 19 Oct 2024 13:57:27 +0000

There are a few key questions which need to be thoroughly understood and answered before selecting a large language model to be used for building an application:

License – because you don’t want to end up in a legal trap
Expectations: accuracy, speed and cost tradeoffs
Understanding of benchmarks the model was evaluated on – so you don’t get surprised when evaluating the model with your users on your data
Deployment options – because building a PoC you run on your laptop is often far from production deployment.

License

This sounds easy; open source is open, as the name suggests. Well, not exactly. Ensure that the model you choose can be used as you want. For example, there is a statement in the Llama-2 license:

v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).

This means that if you start with Llama-2 or its fine-tuned successors and later, at some point in time, decide to switch to a different model, you are not allowed to use your historical data to train the new LLM. Or are you? If you were to modify one line of model code, it is no longer the original Llama Material and so on. In general, AI and code base product regulations are hard to assess and interpret, so it is probably safe to try finding a model with an Apache License or an even more permissive license first, if possible.

Define your expectation: accuracy, speed and cost tradeoffs.

It is tempting to dream big, especially for non-technical people who have seen the recent OpenAI Dev Day with the announcements of GPTs, Assistants and Google Gemini & Lumiere models. But in reality, meeting excessive expectations is challenging and often impossible. Going from 0% to 90% AI automation is difficult but doable; closing the gap between 90% and 100% is exceptionally demanding.

Think about Github Copilot. It won’t write a project for you, but its blazingly fast few-line completions, which usually require little adjusting, make engineers far more productive.

Ask the question: does my model need to figure out the nitty-gritty details? Or leave some space for users to interact and fill the missing gaps while creating a valuable product.

Maybe some parts of the pipeline can be postponed and implemented as batch jobs? The cost reductions might be significant in this case since closed model providers don’t offer lower prices for batch requests. In-house LLM allows you to process tasks offline in batches.

The recently released small Open Source LLM, such as Llama-2 and Mistral, or their further fine-tuned versions, like Zephyr and OpenHermes-2.5, are a perfect match for such a scenario. If you cannot compromise on accuracy, maybe there is a way to algorithmically fix weak spots.

On the other hand, it might be valuable to provide users with a few different model outputs or allow them to iterate and guide the suggestions quickly, such as with GithubCopilot. GPT-4 is powerful, but it would take minutes to call it a few times. Smaller models allow you to do such things. Recent features from Hugging Face and Nvidia can run Llama-v2-13b with an unbelievable speed of 1200 tokens per second.

Understand the benchmarks the model was evaluated on

When choosing the model, you will probably focus on its size, performance, and ‘the vibe’ – whether the model’s responses generally feel good. The performance is most often checked using the results of well-known benchmarks.

What are the weak spots of this approach?

First, the ML Labs releasing the model does not always publish the training data or even more precise information about what the model was trained on. It is often the case that we only see that ‘the model was trained on a well-curated corpus of X tokens’. And because those benchmarks are so popular, there is a possibility of some leakages into the training set. Not immediately the whole corpus, but for example, an automatic web crawl can contain conversations from Reddit or X/Twitter feeds about a particular task where people are discussing some parts of the benchmark.

Secondly, keep in mind that, in general, it is hard to benchmark written text automatically.

To uncover that, it is crucial to understand how each of those benchmarks is created and what it measures.

Let’s see an example question from one of the most popular ones, the MMLU (Massive Multitask Language Understanding):

Question: Glucose is transported into the muscle cell:

Choices:
A. via protein transporters called GLUT4.
B. only in the presence of insulin.
C. via hexokinase.
D. via monocarboxylic acid transporters.

Correct answer: A

And let’s ask ChatGPT:

Good? The answer is just “A”, so even though the model gets it, automatic evaluation would score it as a failure!

Without doing a deep dive into how the evaluation is actually done, there is an excellent blog on HuggingFace explaining it in detail; you should just know that it requires taking bare following tokens’ probabilities and using the model through code in a different way than you would interact with it through chat on some WebUI.

So, the key takeaway is that while those benchmarks provide us with a general ranking of models’ performance, one should pay close attention to how they are evaluated and whether this form of evaluation is meaningful for their use case.

For example, ChatGPT is a so-called Instruction-Finetuned model tuned to follow user instructions and interact with them. If you put a phrase:

Can you help me with that:
{arbitrary problem description}

It will very likely start it’s response with:

Certainly! {probably a good solution to your problem}

And if you were to check tokens probabilities for options A, B, C, and D from the above-mentioned MMLU example, as it is done in one implementation of MMLU, you would get C! But not because the model thinks the completion for the

Correct answer

Is C, but because it wants to start with Certainly!

Deployment options

Last but not least, let’s talk about inference. When you have chosen and maybe even fine-tuned your model further, it’s time to answer the question of what exactly you want to deploy and where.

The ‘what exactly’ part is essential. To start with, you probably have X billion parameters model in (b)float16. There are two options for improvement here: quantization and pruning.

Quantization converts some 16-bit weights into 8 or 4 bits so you can run the model on a smaller and cheaper GPU. Of course, by doing so, we lose some information and accuracy. It can be done automatically using some general formulas, or you can specify an evaluation dataset to quantize in a way that reduces some metrics the least.

It is important to note that currently, on most hardware, quantization reduces memory usage but reduces inference speed. Although the model weights are much smaller, some values must be cast back and forth. But it allows you to inference/fine-tune the model on cheaper hardware or just the available hardware since it might be hard for new players to get access to A100 & H100 clusters.

Both ways are available in the HuggingFace library and can be easily applied. Here, you can find the blog post going through their pros and cons and inference speed/memory comparison.

Pruning, on the other hand, works by completely removing some weights from the model.

It is important to remember that the transformer model under the hood does matrix multiplication, so you can just remove all entries close to zero and expect the performance to improve because it will cause some non-sequential memory accesses. A more gentle solution is needed. The PyTorch team has recently posted 2 blog posts about accelerating Generativ-AI, where they go into detail about available options.

Where to deploy?

Though for real-time chat applications, data centre deployment or on-premise, with high availability, there are some cost-saving techniques if you have offline steps in your data pipeline.

Currently, everyone runs LLM models on either A10 or A100 / H100, but surprisingly, not so many people know that cards from the RTX family are also a good performance choice for such applications.

Unfortunately, NVIDIA knows that, and they put the following statements in their license.

No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted.

But there are companies like vast.ai which offer RTX cards but with lower reliability than, for example, AWS ec2 instances, which you can use for offline data processing. The default filter for availability here is set to 90%, while on the AWS EC2 Service Level Agreement, commitment is 99.99%.

The post What you need to know before deploying Open Source LLM appeared first on TantusData.

RDD in Apache Spark

Amadeusz Kosik — Fri, 28 Jun 2024 07:33:40 +0000

Do you want to see the number of partitions? Or the partition size in rows from within the job? Or maybe you just like some low-level hacking? The RDD API is still there and accessible via the .rdd method.

Why is RDD API still around?

Despite deprecating the RDD API, the engine of Apache Spark (at least its Open Source part – see our article on Photon), the RDDs are still used in its internals. It is also available via the .rdd method of Datasets, and, therefore, DataFrames as well. Keep in mind, though, that the number of actually useful operations not available from Dataset API is really low, and currently, excluding some low-level or Spark internals hacking, it boils down to partitions count and size checking – using getNumPartitions method or glom operator, respectively.

Check number of partitions

RDD API still keeps the getNumPartitions method for that use case:

println(inputData.rdd.getNumPartitions)

Check number of rows per partition

The glom function coalesces all rows in each partition into an array. It can be used to check the number of rows per partition. In the case of wide rows, consider using select to limit the number of columns—RDD are not optimized by most of Spark’s mechanisms.

inputData.rdd.glom().map(_.length).collect().foreach(println _)

The post RDD in Apache Spark appeared first on TantusData.

Datasets and DataFrames

Amadeusz Kosik — Tue, 11 Jun 2024 06:41:52 +0000

With the deprecation of the public use of the old good RDD API, Spark users are left with two options: typed Datasetsand untyped DataFrames (that are actually a specific case of Datasets). The API also allows users to freely cast one to another – e.g. using the .as[T] method to cast an untyped DataFrame to a Dataset[T]. It does not change the underlying data though and can result in surprising results if one is not aware of that.

What does .as[T] do?

Let’s start by looking at the source (code):

Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:
When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark. sql. caseSensitive).
When U is a tuple, the columns will be mapped by ordinal (i. e. the first column will be assigned to _1).
When U is a primitive type (i. e. String, Int, etc), then the first column of the DataFrame will be used.
If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
Note that as[] only changes the view of the data that is passed into typed operations, such as map(), and does not eagerly project away any columns that are not present in the specified class.
Since:
1.6.0

The last note is crucial: casting to a Dataset does not change the underlying data. Any columns not present in the T (e.g., columns without a corresponding field in the case class) will not be discarded.

Why bother?

There are a few situations where having extra columns may be surprising and create problems with a job run (or even worse – silently introduce data quality issues):

running union or unionAll transformations on non-aligned data,
calling distinct (it will check for hidden columns’ uniqueness as well),
saving data (will include extra columns).

A defensive version of .as[T]

The simple version of a defensive (meaning: adjusting the schema to the provided domain class) would be one with a .select() transformation call:

case class Artist(id: String, name: String, location: String)

def toArtistsDefensive(input: DataFrame): Dataset[Artist] = { input
  .select("id", "name", "location")
  .as[Artist]
}

This is a very DRY-unfriendly implementation, as each modification of the Artists class requires searching for all related select instances and updating them. Fortunately, with a bit of reflection, it can be refactored into a generic solution. This generic transformation will trim the Dataset to contain only the expected columns.

import scala.reflect.runtime.universe._

def toTDefensive[T <: Product: TypeTag](input: DataFrame): Dataset[T] = { 
  val caseClassFields = typeOf[T].members
    .collect { case m: MethodSymbol if m.isCaseAccessor => m.name.toString }
    .toSeq
  
  val columns = caseClassFields
    .map(F.col _)
    .reverse

  input
    .select(columns: _*)
    .as[T]
}

The post Datasets and DataFrames appeared first on TantusData.

Monitoring Airflow jobs with TIG 2: data quality metrics

Amadeusz Kosik — Tue, 30 Apr 2024 13:31:53 +0000

In the first article on Monitoring Airflow jobs with TIG, “System Metrics”, we have seen an example of Airflow installation with a TIG stack set up to monitor it. To fully utilize this stack, we should enrich the raw system metrics with statistics on the processed data. Without this, the metrics would tell if the data pipelines are doing anything but not whether they are working on the correct data.

What to look for?

What can be realistically monitored is a pretty deep topic without a one-fits-all answer. The safe starting point is to look for the size of the data, duplicates (or unique rows), null/missing columns’ values and basic aggregates (count per some enumerated type or min/max values). The nice part of this issue is it is not limited by any software, and you can report any numeric value into an InfluxDB database.

Equally important is not limiting the monitoring to the output of the whole pipeline only. Being able to check the data volume and basic traits on the input and in intermediate steps is crucial, as it enables one to check, identify and react to problems early on (and avoid painful backtracking and recomputing of the whole pipeline).

An example data metrics dashboard showing row count, unique row count, and null rows for three steps in the imaginary data pipeline: load, process, and export. It is available to run a local demo on our GitHub.

Computing the metrics

Technically speaking, such monitoring requires two things in the pipeline: a code (or job) to compute the metric and a wrapper to send it to the metrics database. We did not cover the former here – it may vary from a simple SQL query run via Hive / Impala to a side output of a Spark job.

Storing the data for graphs

The second part to be done in Airflow is sending the data to the database. At the time of writing this article, the built-in InfluxDB connector allows only querying the database. Please see our demo (especially the plugins directory) for an example implementation of InfluxDB write. You can also use the REST API or BashOperator to call the influx command there.

Merging both steps or not?

Both compute and send metrics steps may be squashed into a single bash step instead of scheduling them separately and stitching them via XComs. However, the more complicated or time-consuming the calculation may be, the better the separated approach would seem. This is a decision for you to make; we provide an example of the former approach.

Summary

After the first step, the example stack has monitoring of the system, and an operator can see whether the system is working and does not have an overload or some kind of bottleneck. This step adds a base monitoring of the data quality. Adding those on multiple points in the data pipeline will also enable verification during the processing – in a centralized place (or, in this case, WebUI). Once again, a development/demo environment is available on our GitHub.

The post Monitoring Airflow jobs with TIG 2: data quality metrics appeared first on TantusData.

Monitoring Airflow jobs with TIG 1: system metrics

Amadeusz Kosik — Tue, 16 Apr 2024 10:25:34 +0000

Like many server applications, Airflow can – and should – be monitored for metrics and logs. In this article, we will look into the former and the integration with the TIG stack. The goal is to pull example metrics into a time series database and visualize it in a web application. This article will focus on AF system metrics, like its performance, load or health. We will cover the topic of reporting the data metrics into the same database in an upcoming article, “TITLE”. So, stay tuned.

Why?

The Airflow itself finally got a Cluster status dashboard built-in. Therefore, it is valid to question the need for a separate dashboard that requires additional effort and maintenance. Why bother, then? There are several possible reasons to introduce a separate monitoring stack:

Aggregated view: looking into one app is easier than browsing through 10 of them. TIG (or any other monitoring stack) can be used for monitoring multiple instances and apps and is easily integrated with custom ones.

Security: there is no need to give access to the AF itself (or any other application) to be able to see the metrics and pinpoint errors. It is aligned with the data mesh and any other data democratization approach.
Support friendly: your 1st line support has a starting point to check the status of the data processing and be able to call the right people in case of a problem.

The toolkit

Out of the box (but with the right pip packages) Airflow supports sending its internal metrics to a statsd server. We can leverage that and set up such a server via Telegraf to proxy the metrics further into a time series database: InfluxDB. As a convenient UI, Grafana can be used.

TIG stack configuration

Firstly, let’s configure the TIG stack to accept the metrics:

Install an InfluxDB instance. Nothing fancy here.
Install the Telegraf. In the configuration, look for the statsd input (which has to be enabled) and InfluxDB output. The latter will require at least authentication. Note that the Telegraf supports many more inputs and allows monitoring of, e.g. system resources (CPU, memory, disk space, etc.). This is a good idea to configure in a production environment.
Install Grafana and configure the data source to point it to the InfluxDB.

We have prepared a docker compose-based demo in the GitHub repository with a pre-configured environment. You can see there the example configuration for Telegraf (telegraf.conf), InfluxDB (influxdb.env) and Grafana (grafana-provisioning/datasources). Please note that this is only for dev/demo purposes, and real production environments need to be set up more securely (including the use of HTTPS and secure password handling).

Airflow

On the Airflow side, there are two significant points to be addressed.

Airflow requires the Apache-airflow [statsd] package to have the statsd client available – you can do it via pip.
In the airflow configuration file (usually: Airflow.cfg), the [metrics] section must be configured to enable the statsd, point it to the Telegraf instance and prefix the metrics (very useful in case of multiple Airflow instances).

There are several metrics available to be sent in Airflow. You can see the complete list in the Airflow documentation here. There, you can also limit which ones are actually reported to the Telegraf. Be aware that even though most of the metrics are said to be reported in seconds, you need to validate them yourself (as in this example).

Dashboards

The last step is to configure Grafana and set up some dashboards. The UI offers a WYSIWYG editor that you can use to tailor it to your needs. The example available on the GitHub might serve as a starting point, as it shows:

state of the Airflow executors (queues and tasks being run at the moment) to see whether any processing is going on,
state of the pools (default and the custom ones) to check potential bottlenecks if you use multiple pools,
task times to identify unexpected stragglers (and compare instances’ run times and find any performance challenges early on).

An example dashboard, available on our GitHub. It contains metrics useful to monitor the load on the Airflow instance and identify potential overload or bottlenecks in the data pipeline processing.

Summary

In conclusion, for Airflow monitoring, you can use specialized metrics tools like TIG stack and aggregate all the metrics from multiple AF instances. The stack can accommodate AF system metrics and data from other applications, including your custom ones. An example of sending data quality metrics is what we will look into in the second part. There is a demo environment on our GitHub to see whether this works for you.

The post Monitoring Airflow jobs with TIG 1: system metrics appeared first on TantusData.

Databricks – Photon

Amadeusz Kosik — Tue, 02 Apr 2024 12:52:56 +0000

The Databricks platform offers two execution engines for the clients: the standard Apache Spark (available as an open-source application) and one with Photon enhancement that brings a performance improvement (as well as extra pricing). Have you ever wondered where this speedup comes from and how it affects designing Apache Spark jobs?

This article is based on Berlkey’s paper on Photon by Databricks people, as that publication is the closest to the source, as the Photon engine is not an open-source project, and its source code is not available to the public.

The general idea

In a nutshell, Photon replaces the standard (bundled) query engine available in Apache Spark, using the same API. For some operations, mostly CPU-heavy ones, the Catalyst optimizer may decide to send the job to Photon instead of using the default execution path, all for performance reasons. In other words, it is an alternative way to execute Spark DAG tasks, not skipping/reordering/reorganizing the data engine.

SIMD

The SIMD stands for Single Instruction, Multiple Data and is one of the ground ideas for job optimisations in Photon. With current architectures running the same operation on multiple instances (values, rows, etc.), data can be optimised via vectorisation, even within a single thread. Photon is said to utilise those optimisations.

C++ and Code generation

The Photon engine is implemented in C++ instead of the JVM native languages (Scala, Java or others). The source paper points to performance reasons, including ‘hitting performance ceilings’. The communication with the rest of the Apache Spark is implemented via JNI. Databricks’ internal benchmarks indicate that the performance hit due to moving data in and out of JVM is not noticeable.

Internal data format

Photon engine uses columnar data representation (same as, e.g. Parquet data format) instead of row data (like the rest of Apache Spark). This is due to SIMD optimizations – kernel implementation that works best on columnar data. The memory management (calling, freeing, etc) is still done via Apache Spark’s memory manager. The data is kept off-heap, so transferring from Photon to Spark does not require copying the data.

When a shuffle operation is necessary, Photon writes a shuffle file and uses Spark API to execute the exchange. However, the data format is not compatible with vanilla spark, and a Photon shuffle read must follow the Photon shuffle write.

When does it help?

Photon is meant to address the CPU-heavy loads. This includes joins (especially hash join) and aggregations. On the other hand, being a non-JVM implementation, Photon obviously does not support UDFs or RDD API. Exact benchmarks and precise speedups are mentioned in the source paper.

Sources

Source paper: Photon: A Fast Query Engine for Lakehouse Systems

I hope this helps. Moreover, if you know any other good sources, do let us know on social media so that everyone can see them.

The post Databricks – Photon appeared first on TantusData.

Airflow — pools and mutexes.

Amadeusz Kosik — Tue, 19 Mar 2024 13:47:32 +0000

Although the ideal data pipeline is made of idempotent and independent tasks, there are some cases when setting up a mutex (a.k.a. part of the job that cannot be run concurrently) is necessary. Fortunately, Airflow supports such cases and offers a few tools varying by complexity to implement such a pipeline.

In this article, we will look at the following DAG in AF. The graph itself is relatively simple; the catch is that load_1 and load_2 operators cannot have concurrently running task instances. We will look at treating loads separately and looking at them as a group.

Therefore, there are three scenarios we will look into:

Only one instance of the operator can be running.
One instance is that the operator cannot run before the older ones are successful.
Only one instance of a group of operators.

The examples are using annotation syntax for Airflow. Still, the concept stays the same for the 1.0 compatible approach – instead of annotation params, use them in any Operator class constructor.

Mutex on an operator

The first scenario is that only one operator instance can be running at a time. If there are multiple runs ready to be scheduled, it does not matter which one goes first. One solution would be using the max_active_tis_per_dag option with the value of 1.

@task(
   max_active_tis_per_dag=1
)

Dependency on past runs

The second example assumes that the nth batch cannot start before the nth-1 one is completed successfully (or at least marked so in Airflow). For this use case, AF offers depends_on_past flag. In this case, you have to be careful and pay some attention to the state of the latest runs. One failed, upstream-failed, or waiting task can halt all future DAG runs.

@task(
   depends_on_past=True
)

Pools – mutex across multiple operators

The most complex approach is required if we need to group multiple operators to make them share a mutex. One way to do it is to put them into one custom pool and limit it to accommodate only one task simultaneously – either by setting a pool with 1 slot or assigning a high number of required slots to each operator.

load_pool = Pool.create_or_update_pool(
   name="load_pool",
   slots=1,
   description="Pool for data load tasks."
)

@task(
   pool=load_pool.pool
)

Source

The source code for all examples and the docker environment to run them is available on GitHub.

The post Airflow — pools and mutexes. appeared first on TantusData.

Passing information between DAGs in Airflow.

Amadeusz Kosik — Thu, 08 Feb 2024 13:34:54 +0000

There are data pipelines where you must pass some values between tasks – not complete datasets, but ~ kilobytes. This can be managed even within the Airflow itself. As always, multiple options are available – let’s review some of them.

In this article, we are looking at sharing data between DAGs, which are connected via run dependencies. Let’s assume that each DAG needs to be run daily, and the first DAG generates some important data for the second DAG.

XCom

XCom would be the first and the recommended approach. It works well with out-of-the-box features like ExternalTaskSensor:

> BashOperator( task_id="show-xcom", bash_command="echo {{ ti.xcom_pull(dag_id='xcom-source', task_ids='update-hive-table-events-triggers') }}" )" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button">

with DAG(
       dag_id="xcom-sink",
       schedule_interval="@daily",
       start_date=datetime(2023, 7, 1),
       catchup=True,
) as xcom_sink:
   ExternalTaskSensor(
       task_id="wait-for-dependency",
       external_dag_id="xcom-source",
       external_task_id="update-hive-table-events-triggers"
   ) >> BashOperator(
       task_id="show-xcom",
       bash_command="echo {{ ti.xcom_pull(dag_id='xcom-source', task_ids='update-hive-table-events-triggers') }}"
   )

XCom vs run id

The XCom is identified by the DAG ID, task ID and run ID. If you want to run a single DAG with a custom run ID, you have to ensure there is an XCom value for that run ID already created. This complicates issuing manual runs.

When not to XCom?

Rather than discussing cases that are tailored to use XCom, let’s focus on examples that are not well supported. Basically, XCom does not work well with Datasets, and you might get some quirky results here: https://github.com/apache/airflow/discussions/33069.

With datasets, you need to refer to the last past value of XCom, effectively losing all benefits of tight coupling. Using that parameter, you will need to deal with race conditions: if the not-latest sink DAG is restarted, it will receive an incorrect value from the source. Moreover, let’s consider a design with multiple sources and a single sink:

Suppose the events dataset gets updated twice and the users’ dataset only once. In that case, the events-with-users will receive only the latest value of the former one. It is up to you to decide whether this behaviour is expected or unacceptable.

Variables

In rare cases, you can use Variables from Airflow to solve the issue. Variables are built-in mechanisms in Airflow that provide a global state (or configuration) for all DAGs to read and write. You can use it to implement the Registry code pattern:

Put a hashmap of the date of run -> metadata into a variable
Use the execution date (named logical_date in the newer versions) as the hashmap key.
Use PythonOperator to update the variable. Read can be done by either Python code or templating.
Use sensors or datasets to schedule DAGs in the correct order.

Beware!

Before you go with the variable route, please keep in mind that compared to XComs, variables are way more costly to maintain. You might need to keep track of the variables’ sizes, implement error handling in your PythonOperators and be mindful of any unsolicited changes in the variables’ values.

External system?

There is always an option of using an external data service to synchronise, similarly to using built-in variables. This may come in the form of data paths on HDFS / S3 / other storage, batch load dates in the database or such. However, this solution creates an inferior design, as you would end up with disadvantages of the variables approach and new implicit dependencies between DAGs and external services.

Summary

XCom + dataset	XCom + sensor	Variables	External
+ no tight coupling between DAGs	+ supports 1:1 task run relationships	+ works with both sensors and datasets	+ works even with 3rd party systems
– race conditions, – no guarantee for 1:1 run relationships	– a bit tighter coupling – for manual run one must supply by hand the exec_date	– handles only data sharing, does not handle orchestration – requires way more effort than XCom	– requires even more effort – hidden dependencies outside AF – complicated architecture

Comparison

Source

The source code for both XCom and Variable examples and the docker environment to run them is available on GitHub.

The post Passing information between DAGs in Airflow. appeared first on TantusData.

NeMo-Guardrails

Bartek Sadlej — Tue, 05 Dec 2023 12:10:04 +0000

Building a dedicated chatbot is both challenging and dangerous. At company X, the model should talk about X’s offer and, ideally, nothing else to save cost, not block throughput, and be sure not to insult anyone. It would also be nice to meet all of those requirements while not sacrificing the chatbot’s performance.

The field of LLM-powered bots is new and rapidly evolving, so many different solutions have emerged, but one of them caught our attention: Nvidia NeMo-Guardrails. Its core value is the ability to define rails to guide conversations while being able to connect an LLM to other services seamlessly and securely.

You can check out how to get started using the examples and user guide on its GitHub page, but since it is very new and, at the time of this writing, the current release is alpha 0.5, there are not many resources online on how to build more complex applications. At TantusData, we’ve been using it a lot recently and want to share a few practical tips.

Agenda:

How to make it work with a model of your choice
Multiple bot actions and responses per one user message and output formatting
Two chat histories: one for displaying to the user, different to guide the model
General tips

How to make it work with a model of your choice

We will use Mistal7BInstruct to illustrate that. The advantage of using this open-source model is that it comes with an official Docker image, which you can use to self-host it, and the API schema follows the one from OpenAI, so it is super easy to integrate it. You can also use this Docker with any other model from HuggingFace. If it is a gated one, such as Llama-2, remember to run Docker with -e HF_TOKEN=… to get access.

There are two things to cover here—connection to the model and prompting.

The connection consists of two parts: config and implementation. The bare minimum implementation follows LangChain LLM interface, which should be put in `config.py` file with an additional line registering it in guardrails:

# config.py
import openai

class Mistral7BInstruct(LLM):
model: str
endpoint_url: str
# also useful to define: temperature ~ 0.0, max_tokens ~ 2K, frequency_penalty ~ 1.

def _call(
self,
prompt: str,
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> str:

openai.api_key = None
openai.api_base = self.endpoint_url

response = openai.Completion.create(
model = self.model,
prompt = prompt,
stop = stop,
**kwargs
)

return response.choices[0].text

@property
def _identifying_params(self):
...

@property
def _llm_type(self):
return {}

register_llm_provider("my_engine_name", Mistral7BInstruct)

Then what you need to do is specify the engine and parameters in `config.yml` file.

models:
- type: main
engine: my_engine_name
parameters:
model: mistralai/Mistral-7B-Instruct-v0.1
endpoint_url: ...

The next thing to cover is prompts.

By now, NeMo-Guardrails works best with `text-davinci-003` (first chat GPT). More recent OpenAI models expect different prompts to create structured output, whereas the OpenSource model needs more strict instructions on what to do; they won’t automatically spot the pattern in two examples and follow.

The main challenge is generating user intent given the current input and definitions provided in `*.co` files. There are prompts for some already implemented and general ones that will be used if the engine is not explicitly implemented. The problem with them is that they lack explicit instruction on what to do, and as we noticed, usually less powerful models, instead of following the intent pattern, go ahead and try to respond to user input.

The solution for mistral is to include explicit instruction.

- task: generate_user_intent
content: |-
"""
{{ general_instruction }}
You must write only user intent as shown in the example. Do not respond to the user. Do not write anything else.
"""
...

Multiple bot actions and responses per one user message and output formatting

There are situations when we want to execute more than one action or value extraction per round and combine all outputs into the final response. The reason not to just write a wrapper function which will do everything at once is the ability to later easily filter or modify some parts from history, which gets automatically inserted into the model prompt. I will cover that in the next section.

Also, when you want to include some line breaks for better formatting, in the default definition… it will either not be visible or displayed as ‘\n’ instead and all bot messages from one round will simply get concatenated.

Now, let’s look at how an example flow might look like:

define bot answer_with_cited_document provide answer
"Answer: $answer_with_cited_document \n\n"

define bot matches_in_db found matches
"Found matches: $matches_in_db"

define flow answer with cited documents
user ask question
$answer_with_cited_document = ...
bot $answer_with_cited_document provide answer
$cited_documents = ... 
$matches_in_db = execute db_search(cited_documents=$cited_documents)
bot $matches_in_db inform found matches

Here, the printed bot answer after concatenation will look as follows:

“Answer: $answer_with_cited_document \n\nFound matches: $matches_in_db”

The way to achieve better formatting might look like this:

define bot formatted_answer print answer
"$formatted_answer"
define flow answer with cited documents
user ask question
$answer_with_cited_document = ...
$cited_documents = ... 
$matches_in_db = execute db_search(cited_documents=$cited_documents)
$formatted_answer = execute format_answer(ans=$answer_with_cited_document, docs=$matches_in_db)
bot formatted_answer print answer

Two chat histories: one for displaying to the user, different to guide the model

The most straightforward reason we must do something with history at some point is the fact that LLMs have limited context. But apart from that, one should understand what gets inserted into the model’s prompt and whether you are not wasting tokens unnecessarily.

By default, NeMo Guardrails inserts the action output into the prompt with such format:

execute db_search
# The result was /* Full result returned from action here */

If our db_search returns a massive Json we have a problem. In the long run, it will fill up the context, but even before that, it can distract the model from paying attention to relevant parts.

It depends on the particular use case, but if all you want to do is display the results with, e.g. links and scores when left unchanged, the search results will be inserted into the prompt twice, once after action execution and the second time as a final bot answer if you use additional action for output formatting even thrice!

We can take advantage of filters to adjust that.

In general prompts, you can find templates like this:

# prompts_general.yml

{{ history | colang }}

— which takes the whole history and parses it into the prompt in colang.

To filter or modify some events, one can add a custom filter in such a way:

# config.py

def modify_actions(events: List[dict]) -> List[dict]:
events = deepcopy(events)

# filter formatting since we will see the exact same string as a final bot answer
events = [event for event in events if not (event['type'] == 'InternalSystemActionFinished' and event['action_name'] == 'format_answer')]

for event in events:
if event['type'] == 'InternalSystemActionFinished' and event['action_name'] == "your_action_name_here":

event['return_value'] = modify event['return_value']])

# filter formatting since we will see the exact same string as the final bot answer

return events

def init(llm_rails: LLMRails):
llm_rails.register_filter(modify_actions, "modify_actions")

And then use it in prompts like this:

# prompts_general.yml

{{ history | modify_actions | colang }}

We use deepcopy because Python’s dictionary modifications like my_dict[‘key’] = val modify the variable passed to function, and without it in later chat rounds, we would have to check whether the value is already modified or not.

Sometimes, it does make sense to clean up the whole history. For example, a user intends to start from the beginning and send a new request. Without history cleaning, previously entered information might produce incorrect prompts and cause irrelevant search results. To achieve that, we define the following flow in the colang file:

define user start new search
    "I'd like to start a new search"
    "May I look for something different"
    "I want to try another conditions"
    "Forget all I asked before"

After that, you can enhance modify_actions method with the following extract:

0: history.pop() return history" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button">

 history = []
    for event in events:
        history.append(event)
        if event['type'] == "UserIntent" and event['intent'] == "start new search":
            while len(history) > 0:
                history.pop()
    return history

General tips

Also, when working with NeMo Guardrails you may find those tips useful.

Use chat mode instead of server when developing. It makes errors easier to spot and highlights output in verbose mode
Take advantage of Python’s logging module. Guardrails print a lot in verbose mode, and configuring different files as output for different modules makes reading much more convenient.
When using a custom LLM, explicitly log its inputs and outputs as this is the most fragile part of Guardrails. If your model is not following the colang pattern for getting the user intent you can’t move forward.

The post NeMo-Guardrails appeared first on TantusData.

What if the data is too large for the LLM context?

Bartek Sadlej — Thu, 28 Sep 2023 11:00:00 +0000

In the previous article, we covered extracting information from unstructured data. However, this is just the tip of the iceberg. Another problem can arise when you have long documents which don’t fit into the embedding model context length. The natural move in such a situation is splitting the documents into multiple parts. Another reason for using this technique is when the entire document does not create good enough embeddings. Last but not least, you might want to extract smaller chunks in order to lower the token usage.

The subject of the document or paragraph is usually at the beginning of the section. It does not show up in the latter parts of the document, so it is likely that when we just split the document into multiple parts, we end up with lots of documents that lack contextual information, for example.

Subscription prices for 1 month:
20 USD / month


Subscription prices for 1 year:
200 USD / year

In the snippet above, we see the price, but we lack information about what the price is for (for TV subscription, for broadband subscription)

When we query a vector database and provide the result to the LLM application, we will likely see that this document seems relevant to TV, broadband or mobile subscription requests. The reason is that we get a high cosine similarity score for any query related to the subscription price. So here we go:

from langchain.chains import RetrievalQA
from langchain.docstore.document import Document
from langchain.schema.retriever import BaseRetriever
from langchain.chat_models import ChatOpenAI


class ConstRetriever(BaseRetriever):
   def _get_relevant_documents(self, *args, **kwargs) -> List[Document]:
       return [doc]
llm = ChatOpenAI(model_name="gpt-4")
retriever = ConstRetriever()
qa = RetrievalQA.from_llm(llm, retriever=retriever)


offers = ["TV", "Internet", "Car", "Gym membership"]


for offer in offers:
   res = qa(f"What is the {offer} subscription price for one year?")['result']
   print(res)

The subscription price for one year is 200 USD.
The subscription price for one year is 200 USD.
The context does not provide information on the car subscription price for one year.
The context provided does not specify what the subscription prices are for, such as a gym membership. Therefore, I can’t provide the exact price for a gym membership subscription for one year.

All those queries have ~0.85-0.9 cosine similarity with the example document. This document is ‘close enough’ and gets provided as input to the LLM. The LLM then has to decide how to answer the question. Moreover, if you think about it, the document content is not enough to say what the price is for, so the best you can expect is to say ‘I don’t know’ so at least it does not make information up, which it does not have in the first place. And that answer is not satisfying anyway – we do have the information about the prices, and we would like to chat to answer it. We just have to find a better way of providing it with the correct information.

How do we tackle this problem?

After splitting, the most straightforward solution is to include additional context for each part. For example, we can add “Details for the TV offer:” if those prices come from such an offer. It helps with solving hallucination problems, but the similarity score may remain high for such documents, which can prevent the retriever from fetching the most relevant documents. The model will answer that it does not have enough context to answer the question.

Another solution is to include document metadata and use a feature called self-query.

Here, instead of including context information directly in the document text, we set it as an additional filtering index and use LLM to produce a relevant query.

The difference is that even though the document has a high similarity score, it will not get fetched, and the retriever can provide genuinely relevant data sources. In other words, instead of relying on vector store to provide relevant documents only by their embeddings’ similarity to the query, we add an extra index and provide the LLM with its description. The model can then decide whether to use it and with what arguments.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo


embeddings = OpenAIEmbeddings()


page_content="""
Subscription prices for 1 month:
20 USD / month


Subscription prices for 1 year:
200 USD / year
"""


docs = [
   Document(
       page_content=page_content,
       metadata={
           "product": "TV",
       },
   ),
]
vectorstore = Weaviate.from_documents(
   docs, embeddings, weaviate_url="http://127.0.0.1:8080"
)


metadata_field_info = [
   AttributeInfo(
       name="product",
       description="The name of the product for which the subscription prices are",
       type="string",
   ),
]
document_content_description = "Details for all the products offers"
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0., verbose=True)
retriever = SelfQueryRetriever.from_llm(
   llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)
qa_with_self_query = RetrievalQA.from_llm(llm, retriever=retriever, return_source_documents=True)


for offer in ["TV", "Internet", "Car", "Gym membership"]:
    res = qa_with_self_query(f"What is the {offer} subscription price for one year?")
    print(f"n docs: {len(res['source_documents'])}, answer: {res['result']}")

This is the result produced by the code above:

query='TV subscription price' 
filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='TV') limit=Nonen docs: 1, 
answer: The TV subscription price for one year is 200 USD.
—-----------------------------------------------
query='Internet subscription price' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='Internet') limit=None
n docs: 0, answer: I'm sorry, but I don't have access to specific pricing information …
query='Car subscription price' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='Car') limit=None
n docs: 0, answer: I'm sorry, but I don't have enough information …
query='Gym membership subscription price' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='Gym membership') limit=None
n docs: 0, answer: I'm sorry, but I don't have access to specific pricing information for gym memberships. …

The drawback of this approach is that it is significantly more expensive because additional calls are needed to provide this functionality. With LangChain OpenAICallback, we can easily monitor the API usage, and for the first solution, it is ~ 0.00015 $ per question, whereas for the second ~ 0.0015 $, so x10 increase.

We also have to keep in mind that creating metadata for splitted documents might not be trivial and may need human supervision.

All things considered, it’s not a surprise that LLM will be as good as the data you provide to it – the more detailed and relevant information you can provide, the higher the chance of getting a good response. Self-querying is a powerful technique which might be useful in the project you are working on. The exact decision on how to provide metadata and whether we should use self-querying depends on a specific business problem to be solved.

The post What if the data is too large for the LLM context? appeared first on TantusData.