The technicalities 7 min read 28 September 2023

What if the data is too large for the LLM context?

Bartek Sadlej

Data Engineer


In the previous article, we covered extracting information from unstructured data. However, this is just the tip of the iceberg. Another problem can arise when you have long documents which don’t fit into the embedding model context length. The natural move in such a situation is splitting the documents into multiple parts. Another reason for using this technique is when the entire document does not create good enough embeddings. Last but not least, you might want to extract smaller chunks in order to lower the token usage.

The subject of the document or paragraph is usually at the beginning of the section. It does not show up in the latter parts of the document, so it is likely that when we just split the document into multiple parts, we end up with lots of documents that lack contextual information, for example.

Subscription prices for 1 month:
20 USD / month

Subscription prices for 1 year:
200 USD / year

In the snippet above, we see the price, but we lack information about what the price is for (for TV subscription, for broadband subscription)

When we query a vector database and provide the result to the LLM application, we will likely see that this document seems relevant to TV, broadband or mobile subscription requests. The reason is that we get a high cosine similarity score for any query related to the subscription price. So here we go:

from langchain.chains import RetrievalQA
from langchain.docstore.document import Document
from langchain.schema.retriever import BaseRetriever
from langchain.chat_models import ChatOpenAI

class ConstRetriever(BaseRetriever):
   def _get_relevant_documents(self, *args, **kwargs) -> List[Document]:
       return [doc]
llm = ChatOpenAI(model_name="gpt-4")
retriever = ConstRetriever()
qa = RetrievalQA.from_llm(llm, retriever=retriever)

offers = ["TV", "Internet", "Car", "Gym membership"]

for offer in offers:
   res = qa(f"What is the {offer} subscription price for one year?")['result']

  • The subscription price for one year is 200 USD.
  • The subscription price for one year is 200 USD.
  • The context does not provide information on the car subscription price for one year.
  • The context provided does not specify what the subscription prices are for, such as a gym membership. Therefore, I can’t provide the exact price for a gym membership subscription for one year.

All those queries have ~0.85-0.9 cosine similarity with the example document. This document is ‘close enough’ and gets provided as input to the LLM. The LLM then has to decide how to answer the question. Moreover, if you think about it, the document content is not enough to say what the price is for, so the best you can expect is to say ‘I don’t know’ so at least it does not make information up, which it does not have in the first place. And that answer is not satisfying anyway – we do have the information about the prices, and we would like to chat to answer it. We just have to find a better way of providing it with the correct information.

How do we tackle this problem?

After splitting, the most straightforward solution is to include additional context for each part. For example, we can add “Details for the TV offer:” if those prices come from such an offer. It helps with solving hallucination problems, but the similarity score may remain high for such documents, which can prevent the retriever from fetching the most relevant documents. The model will answer that it does not have enough context to answer the question. 

Another solution is to include document metadata and use a feature called self-query.

Here, instead of including context information directly in the document text, we set it as an additional filtering index and use LLM to produce a relevant query.

The difference is that even though the document has a high similarity score, it will not get fetched, and the retriever can provide genuinely relevant data sources. In other words, instead of relying on vector store to provide relevant documents only by their embeddings’ similarity to the query, we add an extra index and provide the LLM with its description. The model can then decide whether to use it and with what arguments.  

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

embeddings = OpenAIEmbeddings()

Subscription prices for 1 month:
20 USD / month

Subscription prices for 1 year:
200 USD / year

docs = [
           "product": "TV",
vectorstore = Weaviate.from_documents(
   docs, embeddings, weaviate_url=""

metadata_field_info = [
       description="The name of the product for which the subscription prices are",
document_content_description = "Details for all the products offers"
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0., verbose=True)
retriever = SelfQueryRetriever.from_llm(
   llm, vectorstore, document_content_description, metadata_field_info, verbose=True
qa_with_self_query = RetrievalQA.from_llm(llm, retriever=retriever, return_source_documents=True)

for offer in ["TV", "Internet", "Car", "Gym membership"]:
    res = qa_with_self_query(f"What is the {offer} subscription price for one year?")
    print(f"n docs: {len(res['source_documents'])}, answer: {res['result']}")

This is the result produced by the code above:

query='TV subscription price' 
filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='TV') limit=Nonen docs: 1, 
answer: The TV subscription price for one year is 200 USD.
query='Internet subscription price' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='Internet') limit=None
n docs: 0, answer: I'm sorry, but I don't have access to specific pricing information
query='Car subscription price' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='Car') limit=None
n docs: 0, answer: I'm sorry, but I don't have enough information
query='Gym membership subscription price' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='product', value='Gym membership') limit=None
n docs: 0, answer: I'm sorry, but I don't have access to specific pricing information for gym memberships.

The drawback of this approach is that it is significantly more expensive because additional calls are needed to provide this functionality. With LangChain OpenAICallback, we can easily monitor the API usage, and for the first solution, it is ~ 0.00015 $ per question, whereas for the second ~ 0.0015 $, so x10 increase.

We also have to keep in mind that creating metadata for splitted documents might not be trivial and may need human supervision.

All things considered, it’s not a surprise that LLM will be as good as the data you provide to it – the more detailed and relevant information you can provide, the higher the chance of getting a good response. Self-querying is a powerful technique which might be useful in the project you are working on. The exact decision on how to provide metadata and whether we should use self-querying depends on a specific business problem to be solved.


More insights