The Essential Guide to Effectively Summarizing Massive Documents, Part 1 | by Vinayak Sengupta | Sep, 2024


RAG is a well-discussed and widely implemented solution for addressing document summarizing optimization using GenAI technologies. However, like any new technology or solution, it is prone to edge-case challenges, especially in today’s enterprise environment. Two main concerns are contextual length coupled with per-prompt cost and the previously mentioned ‘Lost in the Middle’ context problem. Let’s dive a bit deeper to understand these challenges.

Note: I will be performing the exercises in Python using the LangChain, Scikit-Learn, Numpy and Matplotlib libraries for quick iterations.

Today with automated workflows enabled by GenAI, analyzing big documents has become an industry expectation/requirement. People want to quickly find relevant information from medical reports or financial audits by just prompting the LLM. But there is a caveat, enterprise documents are not like documents or datasets we deal with in academics, the sizes are considerably bigger and the pertinent information can be present pretty much anywhere in the documents. Hence, methods like data cleaning/filtering are often not a viable option since domain knowledge regarding these documents is not always given.

In addition to this, even the latest Large Language Models (LLMs) like GPT-4o by OpenAI with context windows of 128K tokens cannot just consume these documents in one shot or even if they did, the quality of response will not meet standards, especially for the cost it will incur. To showcase this, let’s take a real-world example of trying to summarize the Employee Handbook of GitLab which can downloaded here. This document is available free of charge under the MIT license available on their GitHub repository.

1 We start by loading the document and also initialize our LLM, to keep this exercise relevant I will make use of GPT-4o.

from langchain_community.document_loaders import PyPDFLoader

# Load PDFs
pdf_paths = ["/content/gitlab_handbook.pdf"]
documents = []

for path in pdf_paths:
loader = PyPDFLoader(path)
documents.extend(loader.load())

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")

2 Then we can divide the document into smaller chunks (this is for embedding, I will explain why in the later steps).

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Split documents into chunks
splits = text_splitter.split_documents(documents)

3 Now, let’s calculate how many tokens make up this document, for this we will iterate through each document chunk and calculate the total tokens that make up the document.

total_tokens = 0

for chunk in splits:
text = chunk.page_content # Assuming `page_content` is where the text is stored
num_tokens = llm.get_num_tokens(text) # Get the token count for each chunk
total_tokens += num_tokens

print(f"Total number of tokens in the book: {total_tokens}")

# Total number of tokens in the book: 254006

As we can see the number of tokens is 254,006, while the context window limit for GPT-4o is 128,000. This document cannot be sent in one go through the LLM’s API. In addition to this, considering this model’s pricing is $0.00500 / 1K input tokens, a single request sent to OpenAI for this document would cost $1.27! This does not sound horrible until you present this in an enterprise paradigm with multiple users and daily interactions across many such large documents, especially in a startup scenario where many GenAI solutions are being born.

Another challenge faced by LLMs is the Lost in the Middle, context problem as discussed in detail in this paper. Research and my experiences with RAG systems handling multiple documents describe that LLMs are not very robust when it comes to extrapolating information from long context inputs. Model performance degrades considerably when relevant information is somewhere in the middle of the context. However, the performance improves when the required information is either at the beginning or the end of the provided context. Document Re-ranking is a solution that has become a subject of progressively heavy discussion and research to tackle this specific issue. I will be exploring a few of these methods in another post. For now, let us get back to the solution we are exploring which utilizes K-Means Clustering.

Okay, I admit I sneaked in a technical concept in the last section, allow me to explain it (for those who may not be aware of the method, I got you).

First the basics

To understand K-means clustering, we should first know what clustering is. Consider this: we have a messy desk with pens, pencils, and notes all scattered together. To clean up, one would group like items together like all pens in one group, pencils in another, and notes in another creating essentially 3 separate groups (not promoting segregation). Clustering is the same process where among a collection of data (in our case the different chunks of document text), similar data or information are grouped creating a clear separation of concerns for the model, making it easier for our RAG system to pick and choose information effectively and efficiently instead of having to go through it all like a greedy method.

K, Means?

K-means is a specific method to perform clustering (there are other methods but let’s not information dump). Let me explain how it works in 5 simple steps:

  1. Picking the number of groups (K): How many groups we want the data to be divided into
  2. Selecting group centers: Initially, a center value for each of the K-groups is randomly selected
  3. Group assignment: Each data point is then assigned to each group based on how close it is to the previously chosen centers. Example: items closest to center 1 are assigned to group 1, items closest to center 2 will be assigned to group 2…and so on till Kth group.
  4. Adjusting the centers: After all the data points have been pigeonholed, we calculate the average of the positions of the items in each group and these averages become the new centers to improve accuracy (because we had initially selected them at random).
  5. Rinse and repeat: With the new centers, the data point assignments are again updated for the K-groups. This is done till the difference (mathematically the Euclidean distance) is minimal for items within a group and the maximal from other data points of other groups, ergo optimal segregation.

While this may be quite a simplified explanation, a more detailed and technical explanation (for my fellow nerds) of this algorithm can be found here.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*