The technicalities 8 min read 4 March 2024

What you need to know before deploying Open Source LLM

Bartek Sadlej

Data Engineer


There are a few key questions which need to be thoroughly understood and answered before selecting a large language model to be used for building an application:

  • License – because you don’t want to end up in a legal trap
  • Expectations: accuracy, speed and cost tradeoffs
  • Understanding of benchmarks the model was evaluated on – so you don’t get surprised when evaluating the model with your users on your data
  • Deployment options – because building a PoC you run on your laptop is often far from production deployment.


This sounds easy; open source is open, as the name suggests. Well, not exactly. Ensure that the model you choose can be used as you want. For example, there is a statement in the Llama-2 license:

v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).

This means that if you start with Llama-2 or its fine-tuned successors and later, at some point in time, decide to switch to a different model, you are not allowed to use your historical data to train the new LLM. Or are you? If you were to modify one line of model code, it is no longer the original Llama Material and so on. In general, AI and code base product regulations are hard to assess and interpret, so it is probably safe to try finding a model with an Apache License or an even more permissive license first, if possible.

Define your expectation: accuracy, speed and cost tradeoffs.

It is tempting to dream big, especially for non-technical people who have seen the recent OpenAI Dev Day with the announcements of GPTs, Assistants and Google Gemini & Lumiere models. But in reality, meeting excessive expectations is challenging and often impossible. Going from 0% to 90% AI automation is difficult but doable; closing the gap between 90% and 100% is exceptionally demanding.

Think about Github Copilot. It won’t write a project for you, but its blazingly fast few-line completions, which usually require little adjusting, make engineers far more productive. 

Ask the question: does my model need to figure out the nitty-gritty details? Or leave some space for users to interact and fill the missing gaps while creating a valuable product.

Maybe some parts of the pipeline can be postponed and implemented as batch jobs? The cost reductions might be significant in this case since closed model providers don’t offer lower prices for batch requests. In-house LLM allows you to process tasks offline in batches.

The recently released small Open Source LLM, such as Llama-2 and Mistral, or their further fine-tuned versions, like Zephyr and OpenHermes-2.5, are a perfect match for such a scenario. If you cannot compromise on accuracy, maybe there is a way to algorithmically fix weak spots. 

On the other hand, it might be valuable to provide users with a few different model outputs or allow them to iterate and guide the suggestions quickly, such as with GithubCopilot. GPT-4 is powerful, but it would take minutes to call it a few times. Smaller models allow you to do such things. Recent features from Hugging Face and Nvidia can run Llama-v2-13b with an unbelievable speed of 1200 tokens per second. 

Understand the benchmarks the model was evaluated on

When choosing the model, you will probably focus on its size, performance, and ‘the vibe’ – whether the model’s responses generally feel good. The performance is most often checked using the results of well-known benchmarks.

What are the weak spots of this approach?

First, the ML Labs releasing the model does not always publish the training data or even more precise information about what the model was trained on. It is often the case that we only see that ‘the model was trained on a well-curated corpus of X tokens’. And because those benchmarks are so popular, there is a possibility of some leakages into the training set. Not immediately the whole corpus, but for example, an automatic web crawl can contain conversations from Reddit or X/Twitter feeds about a particular task where people are discussing some parts of the benchmark.

Secondly, keep in mind that, in general, it is hard to benchmark written text automatically.  

To uncover that, it is crucial to understand how each of those benchmarks is created and what it measures. 

Let’s see an example question from one of the most popular ones, the MMLU (Massive Multitask Language Understanding):

Question: Glucose is transported into the muscle cell:

A. via protein transporters called GLUT4.
B. only in the presence of insulin.
C. via hexokinase.
D. via monocarboxylic acid transporters.

Correct answer: A

And let’s ask ChatGPT:

Good? The answer is just “A”, so even though the model gets it, automatic evaluation would score it as a failure!

Without doing a deep dive into how the evaluation is actually done, there is an excellent blog on HuggingFace explaining it in detail; you should just know that it requires taking bare following tokens’ probabilities and using the model through code in a different way than you would interact with it through chat on some WebUI.

So, the key takeaway is that while those benchmarks provide us with a general ranking of models’ performance, one should pay close attention to how they are evaluated and whether this form of evaluation is meaningful for their use case. 

For example, ChatGPT is a so-called Instruction-Finetuned model tuned to follow user instructions and interact with them. If you put a phrase:

Can you help me with that:
{arbitrary problem description}

It will very likely start it’s response with:

Certainly! {probably a good solution to your problem}

And if you were to check tokens probabilities for options A, B, C, and D from the above-mentioned MMLU example, as it is done in one implementation of MMLU, you would get C! But not because the model thinks the completion for the

Correct answer

Is C, but because it wants to start with Certainly!

Deployment options

Last but not least, let’s talk about inference. When you have chosen and maybe even fine-tuned your model further, it’s time to answer the question of what exactly you want to deploy and where.

The ‘what exactly’ part is essential. To start with, you probably have X billion parameters model in (b)float16. There are two options for improvement here: quantization and pruning.

Quantization converts some 16-bit weights into 8 or 4 bits so you can run the model on a smaller and cheaper GPU. Of course, by doing so, we lose some information and accuracy. It can be done automatically using some general formulas, or you can specify an evaluation dataset to quantize in a way that reduces some metrics the least.

It is important to note that currently, on most hardware, quantization reduces memory usage but reduces inference speed. Although the model weights are much smaller, some values must be cast back and forth. But it allows you to inference/fine-tune the model on cheaper hardware or just the available hardware since it might be hard for new players to get access to A100 & H100 clusters.

Both ways are available in the HuggingFace library and can be easily applied. Here, you can find the blog post going through their pros and cons and inference speed/memory comparison.

Pruning, on the other hand, works by completely removing some weights from the model.

It is important to remember that the transformer model under the hood does matrix multiplication, so you can just remove all entries close to zero and expect the performance to improve because it will cause some non-sequential memory accesses. A more gentle solution is needed. The PyTorch team has recently posted 2 blog posts about accelerating Generativ-AI, where they go into detail about available options.

Where to deploy?

Though for real-time chat applications, data centre deployment or on-premise, with high availability, there are some cost-saving techniques if you have offline steps in your data pipeline.

Currently, everyone runs LLM models on either A10 or A100 / H100, but surprisingly, not so many people know that cards from the RTX family are also a good performance choice for such applications.

Unfortunately, NVIDIA knows that, and they put the following statements in their license.

No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted.

But there are companies like which offer RTX cards but with lower reliability than, for example, AWS ec2 instances, which you can use for offline data processing. The default filter for availability here is set to 90%, while on the AWS EC2 Service Level Agreement, commitment is 99.99%.