LLM 9 min read 26 September 2023

What Data Format is suitable for LLM?

Author
Bartek Sadlej

Data Engineer

Share
Share:

Unpacking LLM: From Hype to Reality

Since ChatGPT and the recent release of Llama-v2 models, it is becoming increasingly popular to build context-aware LLM applications. One such use case is Question Answering over documents. Many focus on cool prototype examples where, after feeding lots of Wikipedia articles to the vector database, one can make sure that Joe Biden is indeed the president of the United States. Only a few focus on current limitations and unsolved problems, which make it challenging to arrive at production-ready applications.

At TantusData, we have been paying close attention to finding weak spots that need to be solved to provide desired functionality. In the upcoming articles, we will be presenting them. In this article, we will start with the challenges related to the data format. 

The code examples currently use the most popular library for creating LLM applications: LangChain.

Data Format

Imagine that you are building a chatbot to answer users’ questions based on company offers.

The question we will be testing is: 

“What are the prices for the internet subscription?” 

Let’s assume for now that our database contains the document with relevant data:

TV Prices:


Subscription prices for 1 month:
L - 40 USD / month
M - 30 USD / month
S - 20 USD / month


Subscription prices for 1 year:
L - 400 USD / year
M - 300 USD / year
S - 200 USD / year


Internet Prices:


Subscription prices for 1 month:
L - 50 USD / month
M - 25 USD / month
S - 10 USD / month


Subscription prices for 1 year:
L - 500 USD / year
M - 250 USD / year
S - 100 USD / year


Phone prices:


Subscription prices for 1 month:
L - 18 USD / month
M - 12 USD / month
S - 6 USD / month


Subscription prices for 1 year:
L - 180 USD / year
M - 120 USD / year
S - 60 USD / year

And when we provide it as a context to the question, we get the desired answer:

from langchain.chains import RetrievalQA
….


doc = Document(page_content=...)


class ConstRetriever(BaseRetriever):
   def _get_relevant_documents(self, *args, **kwargs) -> List[Document]:
       return [doc]


llm = ChatOpenAI(model_name="gpt-3.5-turbo")
retriever = ConstRetriever()
qa = RetrievalQA.from_llm(llm, retriever=retriever)
print(qa("What are the prices for the internet subscription?")['result'])

The code results with the following answer from the chat:

The prices for the internet subscription are as follows:

1 month:
L - 50 USD / month
M - 25 USD / month
S - 10 USD / month

1 year:
L - 500 USD / year
M - 250 USD / year
S - 100 USD / year

Great, it worked. So we are good to go? Well, not really. The hidden problem is that we usually don’t have the relevant documents in such a nice text format. Usually, the data comes from scraping web pages or parsing PDFs, and it might be originally displayed as a table.’When you think about it – the reasons are often quite natural. The idea often comes from a business unit which would like to limit customer service efforts. Customer service works with these documents – they are easy for humans to read. So, when we just mimic what a person does with the document, we might put ourselves in a tricky situation.

What we can do is pick one of the available loaders in LangChain, but we can end up with a text which is not so convenient to read by humans. Maybe it will be good enough for the model? Let’s see.

 Let’s look at the example of a table in a pdf document.

Service Period Subscription Price
TV Month S 20 USD
M 30 USD
L 400 USD
Year S 200 USD
M 300 USD
L 400 USD
Internet Month S 10 USD
M 25 USD
L 50 USD
Year S 100 USD
M 250 USD
L 500 USD
PhoneMonthS6 USD
M12 USD
L18 USD
YearS60 USD
M120 USD
L180 USD
Table 1

When we try to extract the text information from the table, the output depends on the pdf loader we selected:

UnstructuredPDFLoaderPDFMinerLoaderPDFPlumberLoaderPyPDFLoader
Service Period
month
TV
year
month
Internet
year
month
Phone
year
Subscription
Price
20 USD
30 USD
400 USD
200 USD
300 USD
400 USD
10 USD
25 USD
50 USD
100 USD
250 USD
500 USD
6 USD
12 USD
18 USD
60 USD
120 USD
180 USD


















Service
Period
Subscription
Price
TV
Internet
Phone
month
year
month
year
month
year
S
M
L
S
M
L
S
M
L
S
M
L
S
M
L
S
M
L
20 USD
30 USD
400 USD
200 USD
300 USD
400 USD
10 USD
25 USD
50 USD
100 USD
250 USD
500 USD
6 USD
12 USD
18 USD
60 USD
120 USD
180 USD
Service Period
Subscription Price
S 20 USD
month M 30 USD
L 400 USD
TV
S 200 USD
year M 300 USD
L 400 USD
S 10 USD
month M 25 USD
L 50 USD
Internet
S 100 USD
year M 250 USD
L 500 USD
S 6 USD
month M 12 USD
L 18 USD
Phone
S 60 USD
year M 120 USD
L 180 USD

























Service Period
Subscription Price
TVmonthS 20 USD
M 30 USD
L 400 USD
yearS 200 USD
M 300 USD
L 400 USD
InternetmonthS 10 USD
M 25 USD
L 50 USD
yearS 100 USD
M 250 USD
L 500 USD
PhonemonthS 6 USD
M 12 USD
L 18 USD
yearS 60 USD
M 120 USD
L 180 USD




























table 2 version selected

As you probably noticed, there are significant differences in the results.

It is probably not what we would expect. However, let’s check if the model can still get the correct answer:

loadergpt-3.5-turbogpt-4
UnstructuredPDFLoaderThe prices for the internet subscription are as follows:
– 20 USD per month

– 200 USD per year
– 400 USD for 2 years
The text doesn’t provide specific prices for an internet subscription.



PDFMinerLoaderThe prices for the internet subscription are as follows:
– Small (S) package: $20 per month or $200 per year

– Medium (M) package: $30 per month or $300 per year
– Large (L) package: $40 per month or $400 per year
The prices for the internet subscription are:
– Small (S) size: 200 USD per month / 100 USD per year

– Medium (M) size: 300 USD per month / 250 USD per year
– Large (L) size: 400 USD per month / 500 USD per year
PDFPlumberLoaderThe prices for the internet subscription are as follows:
– S: 100 USD per year or 6 USD per month

– M: 250 USD per year or 12 USD per month
– L: 500 USD per year or 18 USD per month
The prices for the internet subscription are:
For the S plan: 100 USD per year or 6 USD per month

For the M plan: 250 USD per year or 12 USD per month
For the L plan: 500 USD per year or 18 USD per month
PyPDFLoaderThe prices for the internet subscription are as follows:
– For the S (Small) plan:   

– Monthly subscription: 10 USD  
– Yearly subscription: 100 USD
– For the M (Medium) plan:   

– Monthly subscription: 25 USD   
– Yearly subscription: 250 USD
– For the L (Large) plan:   

– Monthly subscription: 50 USD   
– Yearly subscription: 500 USD
The prices for the internet subscription are:
For a month:

– S: 10 USD
– M: 25 USD
– L: 50 USD
For a year:

– S: 100 USD
– M: 250 USD
– L: 500 USD







table 3 version selected

As we can see, only one of the four loaders managed to parse the file in a way the chat could understand it.

That is why it is often not straightforward to create a reliable data source for a question-answering model, and one should carefully investigate the format because it usually needs to be corrected. And it is easy to be overlooked because it does it silently when it fails.  

In summary – What can we do about the situation described?

  • First of all, careful engineering and spotting that the problem exists is a must – if you expected some shortcuts, I’m sorry to disappoint you. It’s very easy to build a system that hallucinates very convincingly.
  • Plan to test the result with domain experts.
  • Very likely, the pdf is created out of some billing system database, and relying on it will simplify the data extraction and make it much easier to follow updates.
  • Even if you rely on a database, you still should have data quality checks in place.
  • Last but not least – remember that Generative AI does not solve all possible computer science problems. In the case above, we are struggling with searching for documents in the first place. This problem still needs careful engineering – only when you provide some reasonable structure can you benefit from the LLM magic – human-like interactions and understanding how the data fits into the question being asked.
Share
Share: