Artificial Intelligence

Optimizing Text Input for RAG Models Using Chunking and Text Splitting Strategies

In the field of generative AI, Retrieval-Augmented Generation (RAG) models overcome the limitation of solely relying on data and patterns that are used to train the models; RAG can retrieve and augment relevant information from the institutional data. For developers and technical architects, the challenge is to access a large corpus of data from institutional data sources. This exercise requires breaking down the input text into manageable pieces, a process known as chunking or text splitting. 

In this blog post, we will explore different strategies for text chunking and splitting. We’ll provide practical examples and coding tips to help improve the performance of RAG models, focusing on effective data retrieval, processing efficiency, and the relevance and quality of the generated responses.

Why is Chunking Important?

Chunking strategies improve information retrieval, making sure that each query is both complete and targeted. By splitting text effectively, RAG models can generate a more accurate response that is contextually appropriate. A solution that is designed to have efficient chunking could reduce compute, support better performance, and improve efficiency.

To get the best performance from Retrieval-Augmented Generation (RAG) models, it’s important to chunk the text effectively. Next, we’ll look at some strategies to chunk the text that help these models produce more accurate results.

Strategies for Text Chunking and Splitting

1. Fixed-Length Chunking

Fixed-length chunking is a simple method of text segmentation where text is divided into equal-sized chunks based on a predetermined number of characters or words. This method is useful for processing large volumes of text where you can create uniform blocks of text. The limitation of this method is that there is no semantic integrity of the text. Fixed-length chunking is a beneficial tool in scenarios where the chunk size is more important than the context.

Use Case:

  • Diagnostic Trouble Codes (Automotive): In automotive diagnostics, service manuals, and trouble code descriptions can benefit from fixed-length chunking. 
  • Genetic Sequences (Life Sciences): The analysis of genetic sequences, where data is inherently uniform and structured into specific sequences and lengths, is well-suited for fixed-length chunking.
  • Data Collection from Structured Surveys (Research): In research fields, especially when dealing with structured survey responses or standardized data entries, fixed-length chunking can be effectively employed to uniformly segment and analyze response data. 

Python Code Sample:


def split_text_into_fixed_chunks(input_text, chunk_size):
return [input_text[i:i + chunk_size] for i in range(0, len(input_text), chunk_size)]

 

2. Sentence-Aware Chunking

Sentence-aware chunking method segments text into chunks based on sentence boundaries and ensures adherence to the maximum length for each chunk specified during the chunking process. This approach prioritizes that each segment has a full sentence, and the grammatical integrity and contextual flow of the original text are maintained. This technique is useful for texts where the full understanding of each sentence is important.

Use Case:

  • Academic Articles and Textbooks (Education): In academic settings, especially when dealing with dense scholarly articles or comprehensive textbooks, sentence-aware chunking helps in breaking down complex concepts into manageable units. 
  • Medical Guidelines and Protocols (Healthcare): Medical guidelines and protocols often contain critical, sentence-aware chunking ensures that each instructional step is kept whole within a chunk, reducing the risk of misinterpretation, and enhancing clarity for healthcare providers.

Python Code Sample:


import nltk
# install nltk using “pip install --user -U nltk”
nltk.download('punkt')
# Punkt Tokenizer Models
from nltk.tokenize import sent_tokenize
def sentence_aware_chunking(input_text, max_len):
sentences = sent_tokenize(input_text)
chunks = []
current_chunk = []
current_len = 0
for sentence in sentences:
if current_length + len(sentence) > max_len:
chunks.append(' '.join(current_chunk))
current_chunk = [sentence]
current_len = len(sentence)
else:
current_chunk.append(sentence)
current_len += len(sentence)
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks

 

3. Semantic Chunking

Semantic chunking leverages advanced natural language processing (NLP) tools to segment text based on semantic boundaries using the meaning and context of the chunk the text. It identifies shifts in topics or themes, ensuring that each chunk represents a coherent idea or narrative thread. 

Use Case:

  • Medical Research Papers (Healthcare): Semantic chunking helps in segregating sections in medical research papers that discuss different research findings or theoretical discussions, making the extraction of specific information more efficient for researchers and practitioners. 
  • Market Analysis Reports (Retail): By splitting the text into semantically consistent chunks, applications can retrieve data, facilitating more targeted decision-making based on the segmented analysis of consumer behavior, product performance, and competitive landscape analysis.

Python Code Sample:


from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
def semantic_chunking(input_text, threshold_value):
doc = nlp(input_text)
chunks = []
current_chunk = []
for sent in doc.sents:
current_chunk.append(sent.input_text)
if len(' '.join(current_chunk)) > threshold_value:
chunks.append(' '.join(current_chunk))
current_chunk = []
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks

 

4. Sliding Window Chunking

Sliding window chunking is an text processing technique where text is divided into overlapping chunks using a predefined window size and a specific step size. This approach ensures that the end of one chunk overlaps with the start of the next, preserving the continuity of information across chunk boundaries. It maintains the contextual integrity, ensuring that subsequent chunks can leverage context from previous chunk. 

Use Case:

  • Patient Outcomes Analysis (Healthcare): This technique is useful in analyzing medical transcriptions or clinical notes where patient symptoms, treatments, and outcomes are captured in a sequence. Sliding window chunking helps maintain the narrative flow, ensuring that no critical information is lost between the chunks.
  • Transaction Data Analysis (Banking): Effective for financial analysis reports and continuous transaction records where it is crucial to maintain the flow of information from one transaction or event to the next to accurately track financial trends and patterns.

Python Code Sample:


def sliding_window_chunking(input_text, chunk_size, step_size):
return [input_text[i:i + chunk_size] for i in range(0, len(input_text) - chunk_size + step_size, step_size)]

 

5. Adaptive Chunking

Adaptive chunking is an advanced text splitting strategy that dynamically adjusts chunk sizes based on the complexity and the content of the text segments. This technique employs advanced natural language processing algorithms to analyze the structure and semantics of the text, identifying logical endpoint and themes where chunks should begin and end. This technique maximizes the relevancy and completeness of the information within each chunk, providing better retrieval and generation in RAG models.

Use Case:

  • Patient Data Analysis (Healthcare): Healthcare documentation which has patient records with a mix of structured data (like test results) and unstructured data (like physician’s notes), adaptive chunking can help in isolating specific information for detailed analysis without losing context.
  • Customer Reviews (Retail): In retail, to distinguish between various aspects of feedback or product features from online customer reviews and product descriptions adaptive chunking can be used. This helps to extract actionable insights from customer sentiments that are often represented with general comments, or in summarizing product features that need to be highlighted differently depending on their importance to potential buyers.

Python Code Sample:


import spacy
nlp = spacy.load("en_core_web_sm")
def adaptive_chunking(text, min_length, max_length):
doc = nlp(text)
chunks = []
current_chunk = []
current_length = 0
for sentence in doc.sents:
sentence_length = len(sentence.text)
if current_length + sentence_length > max_length or sentence_length > min_length:
if current_chunk:
chunks.append(' '.join(current_chunk))
current_chunk = [sentence.text]
current_length = sentence_length
else:
# Handle sentence longer than max_length
chunks.append(sentence.text)
current_length = 0
else:
current_chunk.append(sentence.text)
current_length += sentence_length
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks

Conclusion

Effective chunking and text-splitting strategies can significantly enhance the performance of RAG models. By selecting an appropriate method based on the specific needs of your text and application, you can optimize the retrieval process and improve the overall quality of content generation.

Successive
Advantage

Successive Advantage

We design solutions that bring unmatchable customer experience to life and help companies accelerate their growth agendas with breakthrough innovation.

Connect with us ➔
pattern
pattern icon