What Data Format is suitable for LLM?
Unpacking LLM: From Hype to Reality
Since ChatGPT and the recent release of Llama-v2 models, it is becoming increasingly popular to build context-aware LLM applications. One such use case is Question Answering over documents. Many focus on cool prototype examples where, after feeding lots of Wikipedia articles to the vector database, one can make sure that Joe Biden is indeed the president of the United States. Only a few focus on current limitations and unsolved problems, which make it challenging to arrive at production-ready applications.
At TantusData, we have been paying close attention to finding weak spots that need to be solved to provide desired functionality. In the upcoming articles, we will be presenting them. In this article, we will start with the challenges related to the data format.
The code examples currently use the most popular library for creating LLM applications: LangChain.
Data Format
Imagine that you are building a chatbot to answer users’ questions based on company offers.
The question we will be testing is:
“What are the prices for the internet subscription?”
Let’s assume for now that our database contains the document with relevant data:
TV Prices:
Subscription prices for 1 month:
L - 40 USD / month
M - 30 USD / month
S - 20 USD / month
Subscription prices for 1 year:
L - 400 USD / year
M - 300 USD / year
S - 200 USD / year
Internet Prices:
Subscription prices for 1 month:
L - 50 USD / month
M - 25 USD / month
S - 10 USD / month
Subscription prices for 1 year:
L - 500 USD / year
M - 250 USD / year
S - 100 USD / year
Phone prices:
Subscription prices for 1 month:
L - 18 USD / month
M - 12 USD / month
S - 6 USD / month
Subscription prices for 1 year:
L - 180 USD / year
M - 120 USD / year
S - 60 USD / year
And when we provide it as a context to the question, we get the desired answer:
from langchain.chains import RetrievalQA
….
doc = Document(page_content=...)
class ConstRetriever(BaseRetriever):
def _get_relevant_documents(self, *args, **kwargs) -> List[Document]:
return [doc]
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
retriever = ConstRetriever()
qa = RetrievalQA.from_llm(llm, retriever=retriever)
print(qa("What are the prices for the internet subscription?")['result'])
The code results with the following answer from the chat:
The prices for the internet subscription are as follows:
1 month:
L - 50 USD / month
M - 25 USD / month
S - 10 USD / month
1 year:
L - 500 USD / year
M - 250 USD / year
S - 100 USD / year
Great, it worked. So we are good to go? Well, not really. The hidden problem is that we usually don’t have the relevant documents in such a nice text format. Usually, the data comes from scraping web pages or parsing PDFs, and it might be originally displayed as a table.’When you think about it – the reasons are often quite natural. The idea often comes from a business unit which would like to limit customer service efforts. Customer service works with these documents – they are easy for humans to read. So, when we just mimic what a person does with the document, we might put ourselves in a tricky situation.
What we can do is pick one of the available loaders in LangChain, but we can end up with a text which is not so convenient to read by humans. Maybe it will be good enough for the model? Let’s see.
Let’s look at the example of a table in a pdf document.
Service | Period | Subscription | Price |
TV | Month | S | 20 USD |
M | 30 USD | ||
L | 400 USD | ||
Year | S | 200 USD | |
M | 300 USD | ||
L | 400 USD | ||
Internet | Month | S | 10 USD |
M | 25 USD | ||
L | 50 USD | ||
Year | S | 100 USD | |
M | 250 USD | ||
L | 500 USD | ||
Phone | Month | S | 6 USD |
M | 12 USD | ||
L | 18 USD | ||
Year | S | 60 USD | |
M | 120 USD | ||
L | 180 USD |
When we try to extract the text information from the table, the output depends on the pdf loader we selected:
UnstructuredPDFLoader | PDFMinerLoader | PDFPlumberLoader | PyPDFLoader |
Service Period month TV year month Internet year month Phone year Subscription Price 20 USD 30 USD 400 USD 200 USD 300 USD 400 USD 10 USD 25 USD 50 USD 100 USD 250 USD 500 USD 6 USD 12 USD 18 USD 60 USD 120 USD 180 USD | Service Period Subscription Price TV Internet Phone month year month year month year S M L S M L S M L S M L S M L S M L 20 USD 30 USD 400 USD 200 USD 300 USD 400 USD 10 USD 25 USD 50 USD 100 USD 250 USD 500 USD 6 USD 12 USD 18 USD 60 USD 120 USD 180 USD | Service Period Subscription Price S 20 USD month M 30 USD L 400 USD TV S 200 USD year M 300 USD L 400 USD S 10 USD month M 25 USD L 50 USD Internet S 100 USD year M 250 USD L 500 USD S 6 USD month M 12 USD L 18 USD Phone S 60 USD year M 120 USD L 180 USD | Service Period Subscription Price TVmonthS 20 USD M 30 USD L 400 USD yearS 200 USD M 300 USD L 400 USD InternetmonthS 10 USD M 25 USD L 50 USD yearS 100 USD M 250 USD L 500 USD PhonemonthS 6 USD M 12 USD L 18 USD yearS 60 USD M 120 USD L 180 USD |
As you probably noticed, there are significant differences in the results.
It is probably not what we would expect. However, let’s check if the model can still get the correct answer:
loader | gpt-3.5-turbo | gpt-4 |
UnstructuredPDFLoader | The prices for the internet subscription are as follows: – 20 USD per month – 200 USD per year – 400 USD for 2 years | The text doesn’t provide specific prices for an internet subscription. |
PDFMinerLoader | The prices for the internet subscription are as follows: – Small (S) package: $20 per month or $200 per year – Medium (M) package: $30 per month or $300 per year – Large (L) package: $40 per month or $400 per year | The prices for the internet subscription are: – Small (S) size: 200 USD per month / 100 USD per year – Medium (M) size: 300 USD per month / 250 USD per year – Large (L) size: 400 USD per month / 500 USD per year |
PDFPlumberLoader | The prices for the internet subscription are as follows: – S: 100 USD per year or 6 USD per month – M: 250 USD per year or 12 USD per month – L: 500 USD per year or 18 USD per month | The prices for the internet subscription are: For the S plan: 100 USD per year or 6 USD per month For the M plan: 250 USD per year or 12 USD per month For the L plan: 500 USD per year or 18 USD per month |
PyPDFLoader | The prices for the internet subscription are as follows: – For the S (Small) plan: – Monthly subscription: 10 USD – Yearly subscription: 100 USD – For the M (Medium) plan: – Monthly subscription: 25 USD – Yearly subscription: 250 USD – For the L (Large) plan: – Monthly subscription: 50 USD – Yearly subscription: 500 USD | The prices for the internet subscription are: For a month: – S: 10 USD – M: 25 USD – L: 50 USD For a year: – S: 100 USD – M: 250 USD – L: 500 USD |
As we can see, only one of the four loaders managed to parse the file in a way the chat could understand it.
That is why it is often not straightforward to create a reliable data source for a question-answering model, and one should carefully investigate the format because it usually needs to be corrected. And it is easy to be overlooked because it does it silently when it fails.
In summary – What can we do about the situation described?
- First of all, careful engineering and spotting that the problem exists is a must – if you expected some shortcuts, I’m sorry to disappoint you. It’s very easy to build a system that hallucinates very convincingly.
- Plan to test the result with domain experts.
- Very likely, the pdf is created out of some billing system database, and relying on it will simplify the data extraction and make it much easier to follow updates.
- Even if you rely on a database, you still should have data quality checks in place.
- Last but not least – remember that Generative AI does not solve all possible computer science problems. In the case above, we are struggling with searching for documents in the first place. This problem still needs careful engineering – only when you provide some reasonable structure can you benefit from the LLM magic – human-like interactions and understanding how the data fits into the question being asked.