Proposed approaches for building LLMs that deliver precise citations

RAG applications have been extremely popular over the past few years. I was fortunate to join project building one and I’d like to share some experiences and findings (especially regarding Langchain) that I hope will be useful to anyone interested in building their own. I would mainly discuss the citations and some solutions that can help us reduce hallucinations.

To help resolve this issue, I want to emphasize the importance of having a robust evaluation system as it enables timely implementation and testing of different approaches to resolve this issue.

It will also be helpful to set an acceptable threshold for the LLM’s accuracy. Since it is impossible to expect the LLM to always return the answers with correct resources, the threshold will help us to save time trying many different things, while each change could affect one another.

Tech stack: Next.js, LangChain, MongoDB Atlas Vector Search

There are 3 main issues I have seen:

  1. The model completely fails to retrieve relevant files
  2. The model retrieves irrelevant files
  3. The model retrieves the correct files but not the correct subsection

Let’s explore each issue in detail and discuss potential solutions

Issue 1: The model completely fails to retrieve relevant files

  • Exact Word Matching Limitation: the model struggles to retrieve relevant documents when the query doesn’t contain the exact words present in the target content.
  • When a keyword appears in multiple files, it introduces noise and hinders the retrieval of the most relevant document.
  • The model may focus on less important keywords in the query, leading to the retrieval of suboptimal files.

Solutions:

  • Edit the prompt to tell the assistant to not just word search the response in the context documents, but also to look forward with similar meanings or words that are a subcategory of a larger group of things

→ From my experience, prompt editing is highly sensitive, where even small changes can significantly impact the results. I’d highly recommend to make incremental adjustments, modifying only 1–2 sentences at a time.

  • Query Rewriting: leverage Language Model (LLM) capabilities to rephrase the user’s query (add in the prompt to find use words that are more related to benefit programs)

→ LangChain provides default prompt handling capabilities, which can be a good starting point, consider customizing the prompt by modifying it according to your unique requirements.

  • Generate multiple questions from different perspectives.

→ Langchain has MultiQueryRetriever.

Issue 2: The model retrieves irrelevant files

  • Langchain would return in order the number of documents we want to be retrieved and used by LLM. The default is 4 documents. The most matching documents will be returned, but so far we do not know which of the provided documents it’s actually referencing when answering. We simply returned all the top documents

Solutions:

  • Try testing the number of documents returned from the vector store. The maximum from Langchain is 9 documents.

→ From my experience, 6–8 documents gave me the best results overall. However, returning more documents can introduce noise. I’d suggest to combining document retrieval with tool calling or other methods can help precisely define which documents the model should utilize.

  • Try Tool calling from Langchain to see which docs the model is using.
  • Try different vector store configurations (Maximal marginal relevance from Langchain) to see if the result improves
  • Test different prompts to restrict the LLM from returning or answering with the information that could not be found in the returned citations

Issue 3: The model retrieves the correct files but not the correct subsection

Solutions:

This post captured my journey into the fascinating world of building RAG applications so far. I’m looking forward to what we can all achieve with this tool and AI in the future.

Published by

Leave a comment