LLM 5 December 2023

NeMo-Guardrails

Author

Data Engineer

Building a dedicated chatbot is both challenging and dangerous. At company X, the model should talk about X’s offer and, ideally, nothing else to save cost, not block throughput, and be sure not to insult anyone. It would also be nice to meet all of those requirements while not sacrificing the chatbot’s performance.

The field of LLM-powered bots is new and rapidly evolving, so many different solutions have emerged, but one of them caught our attention: Nvidia NeMo-Guardrails. Its core value is the ability to define rails to guide conversations while being able to connect an LLM to other services seamlessly and securely.

You can check out how to get started using the examples and user guide on its GitHub page, but since it is very new and, at the time of this writing, the current release is alpha 0.5, there are not many resources online on how to build more complex applications. At TantusData, we’ve been using it a lot recently and want to share a few practical tips.

Agenda:

How to make it work with a model of your choice
Multiple bot actions and responses per one user message and output formatting
Two chat histories: one for displaying to the user, different to guide the model
General tips

How to make it work with a model of your choice

We will use Mistal7BInstruct to illustrate that. The advantage of using this open-source model is that it comes with an official Docker image, which you can use to self-host it, and the API schema follows the one from OpenAI, so it is super easy to integrate it. You can also use this Docker with any other model from HuggingFace. If it is a gated one, such as Llama-2, remember to run Docker with -e HF_TOKEN=… to get access.

There are two things to cover here—connection to the model and prompting.

The connection consists of two parts: config and implementation. The bare minimum implementation follows LangChain LLM interface, which should be put in `config.py` file with an additional line registering it in guardrails:

# config.py
import openai

class Mistral7BInstruct(LLM):
model: str
endpoint_url: str
# also useful to define: temperature ~ 0.0, max_tokens ~ 2K, frequency_penalty ~ 1.

def _call(
self,
prompt: str,
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> str:

openai.api_key = None
openai.api_base = self.endpoint_url

response = openai.Completion.create(
model = self.model,
prompt = prompt,
stop = stop,
**kwargs
)

return response.choices[0].text

@property
def _identifying_params(self):
...

@property
def _llm_type(self):
return {}

register_llm_provider("my_engine_name", Mistral7BInstruct)

Then what you need to do is specify the engine and parameters in `config.yml` file.

models:
- type: main
engine: my_engine_name
parameters:
model: mistralai/Mistral-7B-Instruct-v0.1
endpoint_url: ...

The next thing to cover is prompts.

By now, NeMo-Guardrails works best with `text-davinci-003` (first chat GPT). More recent OpenAI models expect different prompts to create structured output, whereas the OpenSource model needs more strict instructions on what to do; they won’t automatically spot the pattern in two examples and follow.

The main challenge is generating user intent given the current input and definitions provided in `*.co` files. There are prompts for some already implemented and general ones that will be used if the engine is not explicitly implemented. The problem with them is that they lack explicit instruction on what to do, and as we noticed, usually less powerful models, instead of following the intent pattern, go ahead and try to respond to user input.

The solution for mistral is to include explicit instruction.

- task: generate_user_intent
content: |-
"""
{{ general_instruction }}
You must write only user intent as shown in the example. Do not respond to the user. Do not write anything else.
"""
...

Multiple bot actions and responses per one user message and output formatting

There are situations when we want to execute more than one action or value extraction per round and combine all outputs into the final response. The reason not to just write a wrapper function which will do everything at once is the ability to later easily filter or modify some parts from history, which gets automatically inserted into the model prompt. I will cover that in the next section.

Also, when you want to include some line breaks for better formatting, in the default definition… it will either not be visible or displayed as ‘\n’ instead and all bot messages from one round will simply get concatenated.

Now, let’s look at how an example flow might look like:

define bot answer_with_cited_document provide answer
"Answer: $answer_with_cited_document \n\n"

define bot matches_in_db found matches
"Found matches: $matches_in_db"

define flow answer with cited documents
user ask question
$answer_with_cited_document = ...
bot $answer_with_cited_document provide answer
$cited_documents = ... 
$matches_in_db = execute db_search(cited_documents=$cited_documents)
bot $matches_in_db inform found matches

Here, the printed bot answer after concatenation will look as follows:

“Answer: $answer_with_cited_document \n\nFound matches: $matches_in_db”

The way to achieve better formatting might look like this:

define bot formatted_answer print answer
"$formatted_answer"
define flow answer with cited documents
user ask question
$answer_with_cited_document = ...
$cited_documents = ... 
$matches_in_db = execute db_search(cited_documents=$cited_documents)
$formatted_answer = execute format_answer(ans=$answer_with_cited_document, docs=$matches_in_db)
bot formatted_answer print answer

Two chat histories: one for displaying to the user, different to guide the model

The most straightforward reason we must do something with history at some point is the fact that LLMs have limited context. But apart from that, one should understand what gets inserted into the model’s prompt and whether you are not wasting tokens unnecessarily.

By default, NeMo Guardrails inserts the action output into the prompt with such format:

execute db_search
# The result was /* Full result returned from action here */

If our db_search returns a massive Json we have a problem. In the long run, it will fill up the context, but even before that, it can distract the model from paying attention to relevant parts.

It depends on the particular use case, but if all you want to do is display the results with, e.g. links and scores when left unchanged, the search results will be inserted into the prompt twice, once after action execution and the second time as a final bot answer if you use additional action for output formatting even thrice!

We can take advantage of filters to adjust that.

In general prompts, you can find templates like this:

# prompts_general.yml

{{ history | colang }}

— which takes the whole history and parses it into the prompt in colang.

To filter or modify some events, one can add a custom filter in such a way:

# config.py

def modify_actions(events: List[dict]) -> List[dict]:
events = deepcopy(events)

# filter formatting since we will see the exact same string as a final bot answer
events = [event for event in events if not (event['type'] == 'InternalSystemActionFinished' and event['action_name'] == 'format_answer')]

for event in events:
if event['type'] == 'InternalSystemActionFinished' and event['action_name'] == "your_action_name_here":

event['return_value'] = modify event['return_value']])

# filter formatting since we will see the exact same string as the final bot answer

return events

def init(llm_rails: LLMRails):
llm_rails.register_filter(modify_actions, "modify_actions")

And then use it in prompts like this:

# prompts_general.yml

{{ history | modify_actions | colang }}

We use deepcopy because Python’s dictionary modifications like my_dict[‘key’] = val modify the variable passed to function, and without it in later chat rounds, we would have to check whether the value is already modified or not.

Sometimes, it does make sense to clean up the whole history. For example, a user intends to start from the beginning and send a new request. Without history cleaning, previously entered information might produce incorrect prompts and cause irrelevant search results. To achieve that, we define the following flow in the colang file:

define user start new search
    "I'd like to start a new search"
    "May I look for something different"
    "I want to try another conditions"
    "Forget all I asked before"

After that, you can enhance modify_actions method with the following extract:

 history = []
    for event in events:
        history.append(event)
        if event['type'] == "UserIntent" and event['intent'] == "start new search":
            while len(history) > 0:
                history.pop()
    return history

General tips

Also, when working with NeMo Guardrails you may find those tips useful.

Use chat mode instead of server when developing. It makes errors easier to spot and highlights output in verbose mode
Take advantage of Python’s logging module. Guardrails print a lot in verbose mode, and configuring different files as output for different modules makes reading much more convenient.
When using a custom LLM, explicitly log its inputs and outputs as this is the most fragile part of Guardrails. If your model is not following the colang pattern for getting the user intent you can’t move forward.