
Introduction
The ‘Text chunking’ process in Natural Language Processing (NLP) involves the conversion of unstructured text data into meaningful units. This seemingly simple task belies the complexity of the various methods employed to achieve it, each with its strengths and weaknesses.
At a high level, these methods typically fall into one of two categories. The first, rule-based methods, hinge on the use of explicit separators such as punctuation or space characters, or the application of sophisticated systems like regular expressions, to partition text into chunks. The second category, semantic clustering methods, leverages the inherent meaning embedded in the text to guide the chunking process. These might utilize machine learning algorithms to discern context and infer natural divisions within the text.
In this article, we’ll explore and compare these two distinct approaches to text chunking. We’ll represent rule-based methods with NLTK, Spacy, and Langchain, and contrast this with two different semantic clustering techniques: KMeans and a custom technique for Adjacent Sentence Clustering.
The goal is to equip practitioners with a clear understanding of each method’s pros, cons, and ideal use cases to enable better decision-making in their NLP projects.
In Brazilian slang, "abacaxi," which translates to "pineapple," signifies "something that doesn’t yield a good outcome, a tangled mess, or something that is no good."
Use Cases for Text Chunking
Text chunking can be used by several different applications:
- Text Summarization: By breaking down large bodies of text into manageable chunks, we can summarize each section individually, leading to a more accurate overall summary.
- Sentiment Analysis: Analyzing the sentiment of shorter, coherent chunks can often yield more precise results than analyzing an entire document.
- Information Extraction: Chunking helps in locating specific entities or phrases within text, enhancing the process of information retrieval.
- Text Classification: Breaking down text into chunks allows classifiers to focus on smaller, contextually meaningful units rather than entire documents, which can improve performance.
- Machine Translation: Translation systems often operate on chunks of text rather than on individual words or whole documents. Chunking can aid in maintaining the coherence of the translated text.
Understanding these use cases can help in choosing the most suitable chunking technique for your specific project.
Comparing Different Methods for Semantic Chunking
In this part of the article, we will compare popular methods for semantic chunking of unstructured text: NLTK Sentence Tokenizer, Langchain Text Splitter, KMeans Clustering, and Clustering Adjacent Sentences based on similarity.
In the following example, we’re gonna evaluate this technique using a text extracted from a PDF, processing it into sentences and their clusters.
The data we used was a PDF exported from Brazil’s Wikipedia page.
For extracting text from PDF and split into sentences with NLTK, **** we use the following functions:
from PyPDF2 import PdfReader
import nltk
nltk.download('punkt')
# Extracting Text from PDF
def extract_text_from_pdf(file_path):
with open(file_path, 'rb') as file:
pdf = PdfReader(file)
text = " ".join(page.extract_text() for page in pdf.pages)
return text
# Extract text from the PDF and split it into sentences
text = extract_text_from_pdf(file_path)
Like that, we end with a string text
with 210964 characters of length.
Here’s a sample of the Wiki text:
sample = text[1015:3037]
print(sample)
"""
=======
Output:
=======
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of the 26
states and the Federal District. It is the only country in the Americas to have Portugue se as an official
langua ge.[11][12] It is one of the most multicultural and ethnically diverse nations, due to over a century of
mass immigration from around t he world,[13] and the most popul ous Roman Catholic-majority country.
Bounde d by the Atlantic Ocean on the east, Brazil has a coastline of 7,491 kilometers (4,655 mi).[14] It
borders all other countries and territories in South America except Ecuador and Chile and covers roughl y
half of the continent's land area.[15] Its Amazon basin includes a vast tropical forest, home to diverse
wildlife, a variety of ecological systems, and extensive natural resources spanning numerous protected
habitats.[14] This unique environmental heritage positions Brazil at number one of 17 megadiverse
countries, and is the subject of significant global interest, as environmental degradation through processes
like deforestation has direct impacts on gl obal issues like climate change and biodiversity loss.
The territory which would become know n as Brazil was inhabited by numerous tribal nations prior to the
landing in 1500 of explorer Pedro Álvares Cabral, who claimed the discovered land for the Portugue se
Empire. Brazil remained a Portugue se colony until 1808 when the capital of the empire was transferred
from Lisbon to Rio de Janeiro. In 1815, the colony was elevated to the rank of kingdom upon the
formation of the United Kingdom of Portugal, Brazil and the Algarves. Independence was achieved in
1822 with the creation of the Empire of Brazil, a unitary state gove rned unde r a constitutional monarchy
and a parliamentary system. The ratification of the first constitution in 1824 led to the formation of a
bicameral legislature, now called the National Congress.
"""
NLTK Sentence Tokenizer
The Natural Language Toolkit (NLTK) provides a useful function for splitting text into sentences. This sentence tokenizer divides a given block of text into its component sentences, which can then be used for further processing.
Implementation
Here’s an example of using the NLTK sentence tokenizer:
import nltk
nltk.download('punkt')
# Splitting Text into Sentences
def split_text_into_sentences(text):
sentences = nltk.sent_tokenize(text)
return sentences
sentences = split_text_into_sentences(text)
This returns a list of 2670 sentences
extracted from the input text with a mean of 78 characters per sentence.
Evaluating NLTK Sentence Tokenizer
While the NLTK Sentence Tokenizer is a straightforward and efficient way to divide a large body of text into individual sentences, it does come with certain limitations:
- Language Dependency: The NLTK Sentence Tokenizer relies heavily on the language of the text. It performs well with English but may not provide accurate results with other languages without additional configuration.
- Abbreviations and Punctuation: The tokenizer can occasionally misinterpret abbreviations or other punctuation at the end of a sentence. This can lead to fragments of sentences being treated as independent sentences.
- Lack of Semantic Understanding: Like most tokenizers, the NLTK Sentence Tokenizer does not consider the semantic relationship between sentences. Therefore, a context that spans multiple sentences might be lost in the tokenization process.
Spacy Sentence Splitter
Spacy, another powerful NLP library, provides a sentence tokenization function that relies heavily on linguistic rules. It is a similar approach to NLTK.
Implementation
Implementing Spacy’s sentence splitter is quite straightforward. Here’s how to do it in Python:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
sentences = list(doc.sents)
This returns a list of 2336 sentences
extracted from the input text with a mean of 89 characters per sentence.
Evaluating Spacy Sentence Splitter
Spacy’s sentence splitter tends to create smaller chunks compared to the Langchain Character Text Splitter, as it strictly adheres to sentence boundaries. This can be advantageous when smaller text units are necessary for analysis.
Like NLTK, however, Spacy’s performance depends on the quality of the input text. For poorly punctuated or structured text, the identified sentence boundaries might not always be accurate.
Now, we’ll see how Langchain provides a framework for chunking text data and further compare it with NLTK and Spacy.
Langchain Character Text Splitter
The Langchain Character Text Splitter works by recursively dividing the text at specific characters. It is especially useful for generic text.
The splitter is defined by a list of characters. It attempts to split the text based on these characters until the generated chunks meet the desired size criterion. The default list is ["nn", "n", " ", ""], aiming to keep paragraphs, sentences, and words together as much as possible to maintain semantic coherence.
Implementation
Consider the following example, where we split the sample text extracted from our PDF using this method.
# Initialize the text splitter with custom parameters
custom_text_splitter = RecursiveCharacterTextSplitter(
# Set custom chunk size
chunk_size = 100,
chunk_overlap = 20,
# Use length of the text as the size measure
length_function = len,
)
# Create the chunks
texts = custom_text_splitter.create_documents([sample])
# Print the first two chunks
print(f'### Chunk 1: nn{texts[0].page_content}nn=====n')
print(f'### Chunk 2: nn{texts[1].page_content}nn=====')
"""
=======
Output:
=======
### Chunk 1:
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
=====
### Chunk 2:
is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of
=====
"""
Finally, we end up with 3205 chunks of text, represented by the texts
list. 65.8 characters **** is the mean for each chunk here – a bit less thank NLTK’s mean (79 characters).
Changing Parameters and Using ‘n’ Separator:
For a more customized approach on the Langchain Splitter, we can alter the chunk_size
and chunk_overlap
parameters according to our needs. Additionally, we can specify only one character (or set of characters) for the splitting operation, such as n
. This will guide the splitter to separate the text into chunks only at the new line characters.
Let’s consider an example where we set chunk_size
to 300, chunk_overlap
to 30, and only use n
as the separator.
# Initialize the text splitter with custom parameters
custom_text_splitter = RecursiveCharacterTextSplitter(
# Set custom chunk size
chunk_size = 300,
chunk_overlap = 30,
# Use length of the text as the size measure
length_function = len,
# Use only "nn" as the separator
separators = ['n']
)
# Create the chunks
custom_texts = custom_text_splitter.create_documents([sample])
# Print the first two chunks
print(f'### Chunk 1: nn{custom_texts[0].page_content}nn=====n')
print(f'### Chunk 2: nn{custom_texts[1].page_content}nn=====')
Now, let’s compare some outputs from the standard set of parameters with the custom parameters:
# Print the sampled chunks
print("==== Sample chunks from 'Standard Parameters': ====nn")
for i, chunk in enumerate(texts):
if i < 4:
print(f"### Chunk {i+1}: n{chunk.page_content}n")
print("==== Sample chunks from 'Custom Parameters': ====nn")
for i, chunk in enumerate(custom_texts):
if i < 4:
print(f"### Chunk {i+1}: n{chunk.page_content}n")
"""
=======
Output:
=======
==== Sample chunks from 'Standard Parameters': ====
### Chunk 1:
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
### Chunk 2:
is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of
### Chunk 3:
of the union of the 26
### Chunk 4:
states and the Federal District. It is the only country in the Americas to have Portugue se as an
==== Sample chunks from 'Custom Parameters': ====
### Chunk 1:
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of the 26
### Chunk 2:
states and the Federal District. It is the only country in the Americas to have Portugue se as an official
langua ge.[11][12] It is one of the most multicultural and ethnically diverse nations, due to over a century of
### Chunk 3:
mass immigration from around t he world,[13] and the most popul ous Roman Catholic-majority country.
Bounde d by the Atlantic Ocean on the east, Brazil has a coastline of 7,491 kilometers (4,655 mi).[14] It
### Chunk 4:
borders all other countries and territories in South America except Ecuador and Chile and covers roughl y
half of the continent's land area.[15] Its Amazon basin includes a vast tropical forest, home to diverse
"""
We can already see that these custom parameters yield much bigger chunks and therefore keep more content than the default set of parameters.
Evaluating the Langchain Character Text Splitter
After splitting the text into chunks using different parameters, we obtain two lists of chunks: texts
and custom_texts
, containing 3205 and 1404 text chunks, respectively. Now, let’s plot the distribution of chunk lengths for these two scenarios to better understand the impact of changing the parameters.

In this histogram, the x-axis represents the chunk lengths, while the y-axis represents the frequency of each length. The blue bars represent the distribution of chunk lengths for the original parameters, and the orange bars represent the distribution of the custom parameters. By comparing these two distributions, we can see how the changes in parameters affected the resulting chunk lengths.
Remember, the ideal distribution depends on the specific requirements of your text-processing task. You might want smaller, more numerous chunks if you’re dealing with fine-grained analysis or larger, fewer chunks for broader semantic analysis.
Langchain Character Text Splitter vs. NLTK and Spacy
Earlier, we generated 3205 chunks using the Langchain splitter with its default parameters. The NLTK Sentence Tokenizer, on the other hand, split the same text into a total of 2670 sentences.
To get a more intuitive understanding of the difference between these methods, we can visualize the distribution of chunk lengths. The following plot shows the densities of chunk lengths for each method, allowing us to see how the lengths are distributed and where most of the lengths lie.

From Figure 1, we can see that the Langchain splitter results in a much more concise density of cluster lengths and has a tendency to have more of longer clusters whereas NLTK and Spacy seem to produce very similar outputs in terms of cluster length, preferring smaller sentences while having lots of outliers with lengths that can reach up to 1400 characters – and a tendency of decreasing length.
KMeans Clustering
Sentence Clustering is a technique that involves grouping sentences based on their semantic similarity. By using sentence embeddings and a clustering algorithm such as K-means, we can implement Sentence Clustering.
Implementation
Here is a simple example code snippet using the Python library sentence-transformers
for generating sentence embeddings and scikit-learn
for K-means clustering:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
# Load the Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Define a list of sentences (your text data)
sentences = ["This is an example sentence.", "Another sentence goes here.", "..."]
# Generate embeddings for the sentences
embeddings = model.encode(sentences)
# Choose an appropriate number of clusters (here we choose 5 as an example)
num_clusters = 3
# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters)
clusters = kmeans.fit_predict(embeddings)
You can see here that the steps for clustering a list of sentences are:
- Load a Sentence Transform model. In this case, we’re using
all-MiniLM-L6-v2
from sentence-transformers/all-MiniLM-L6-v2 in HuggingFace. - Define your sentences and generate their embeddings with the
encode()
method from the model. - Then you define your clustering technique and number of clusters (we’re using KMeans with 3 clusters here) and finally fit it into the dataset.
Evaluating KMeans Clustering
And finally we plot a WordCloud for each cluster.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
nltk.download('stopwords')
# Define a list of stop words
stop_words = set(stopwords.words('english'))
# Define a function to clean sentences
def clean_sentence(sentence):
# Tokenize the sentence
tokens = word_tokenize(sentence)
# Convert to lower case
tokens = [w.lower() for w in tokens]
# Remove punctuation
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# Remove non-alphabetic tokens
words = [word for word in stripped if word.isalpha()]
# Filter out stop words
words = [w for w in words if not w in stop_words]
return words
# Compute and print Word Clouds for each cluster
for i in range(num_clusters):
cluster_sentences = [sentences[j] for j in range(len(sentences)) if clusters[j] == i]
cleaned_sentences = [' '.join(clean_sentence(s)) for s in cluster_sentences]
text = ' '.join(cleaned_sentences)
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title(f"Cluster {i}")
plt.show()
Below we have the WordCloud plots for the generated clusters:



In our analysis of the word cloud for the KMeans clustering, it’s evident that each cluster distinctively differentiates based on the semantics of its most frequent words. This demonstrates a strong semantic differentiation amongst clusters. Moreover, a noticeable variation in cluster sizes is observed, indicating a significant disparity in the number of sequences each cluster comprises.
Limitations of KMeans Clustering
Sentence clustering, although beneficial, does have a few notable drawbacks. The primary limitations include:
- Loss of Sentence Order: Sentence clustering doesn’t retain the original sequence of sentences, which could distort the natural flow of the narrative. This is very important
- Computational Efficiency: KMeans can be computationally intensive and slow, especially with large text corpora or when working with a larger number of clusters. This can be a significant drawback for real-time applications or when handling big data.
Clustering Adjacent Sentences
To overcome some of the limitations of KMeans clustering, especially the loss of sentence order, an alternative approach could be clustering adjacent sentences based on their semantic similarity. The fundamental premise of this approach is that two sentences that appear consecutively in a text are more likely to be semantically related than two sentences that are farther apart.
Implementation
Here’s an expanded implementation of this heuristics using Spacy sentences as inputs:
import numpy as np
import spacy
# Load the Spacy model
nlp = spacy.load('en_core_web_sm')
def process(text):
doc = nlp(text)
sents = list(doc.sents)
vecs = np.stack([sent.vector / sent.vector_norm for sent in sents])
return sents, vecs
def cluster_text(sents, vecs, threshold):
clusters = [[0]]
for i in range(1, len(sents)):
if np.dot(vecs[i], vecs[i-1]) < threshold:
clusters.append([])
clusters[-1].append(i)
return clusters
def clean_text(text):
# Add your text cleaning process here
return text
# Initialize the clusters lengths list and final texts list
clusters_lens = []
final_texts = []
# Process the chunk
threshold = 0.3
sents, vecs = process(text)
# Cluster the sentences
clusters = cluster_text(sents, vecs, threshold)
for cluster in clusters:
cluster_txt = clean_text(' '.join([sents[i].text for i in cluster]))
cluster_len = len(cluster_txt)
# Check if the cluster is too short
if cluster_len < 60:
continue
# Check if the cluster is too long
elif cluster_len > 3000:
threshold = 0.6
sents_div, vecs_div = process(cluster_txt)
reclusters = cluster_text(sents_div, vecs_div, threshold)
for subcluster in reclusters:
div_txt = clean_text(' '.join([sents_div[i].text for i in subcluster]))
div_len = len(div_txt)
if div_len < 60 or div_len > 3000:
continue
clusters_lens.append(div_len)
final_texts.append(div_txt)
else:
clusters_lens.append(cluster_len)
final_texts.append(cluster_txt)
Key takeaways from this code:
- Text Processing: Each text chunk is passed to the
process
function. This function uses the SpaCy library to create sentence embeddings, which are used to represent the semantic meaning of each sentence in the text chunk. - Cluster Creation: The
cluster_text
function forms clusters of sentences based on the cosine similarity of their embeddings. If the cosine similarity is less than a specified threshold, a new cluster begins. - Length Check: The code then checks the length of each cluster. If a cluster is too short (less than 60 characters) or too long (more than 3000 characters), the threshold is adjusted and the process repeats for that particular cluster until an acceptable length is achieved.
Let’s take a look at some of the output chunks from this approach and compare them to Langchain Splitter:
==== Sample chunks from 'Langchain Splitter with Custom Parameters': ====
### Chunk 1:
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of the 26
### Chunk 2:
states and the Federal District. It is the only country in the Americas to have Portugue se as an official
langua ge.[11][12] It is one of the most multicultural and ethnically diverse nations, due to over a century of
==== Sample chunks from 'Adjacent Sentences Clustering': ====
### Chunk 1:
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
is Brasília, and its most popul ous city is São Paulo.
### Chunk 2:
The federation is composed of the union of the 26
states and the Federal District. It is the only country in the Americas to have Portugue se as an official
langua ge.[11][12]
Great, now let’s compare the distribution of chunk lengths of the final_texts
(from the adjacent sequence clustering approach) with the distributions from the Langchain Character Text Splitter and NLTK Sentence Tokenizer. To do this, we’ll first need to calculate the lengths of the chunks in final_texts
:
final_texts_lengths = [len(chunk) for chunk in final_texts]
We can now plot the distributions of all three methods:

From Figure 6, we can derive that the Langchain splitter, using its predefined chunk size, creates a uniform distribution, implying consistent chunk lengths.
The Spacy Sentence Splitter and the NLTK Sentence Tokenizer, on the other hand, seem to prefer smaller sentences, though with many larger outliers, indicating their reliance on linguistic cues to determine splits and potentially produce irregularly sized chunks.
Lastly, the custom Adjacent Sequence Clustering approach, which clusters based on semantic similarity, exhibits a more varied distribution. This could be indicative of a more context-sensitive approach, maintaining the coherence of content within chunks while allowing for more flexibility in size.
Evaluating Adjacent Sequence Clustering Approach
The Adjacent Sequence Clustering Approach brings unique benefits:
- Contextual Coherence: Generates thematically consistent chunks by considering semantic and contextual coherence.
- Flexibility: Balances context preservation and computational efficiency, providing adjustable chunk sizes.
- Threshold Tuning: Allows users to fine-tune the chunking process according to their needs, by adjusting the similarity threshold.
- Sequence Preservation: Retains the original order of sentences in the text, essential for sequential language models and tasks where text order matters.
Comparing Text Chunking Methods: Summary of Insights
Langchain Character Text Splitter
This method provides consistent chunk lengths, yielding a uniform distribution. This could be beneficial when a standard size is necessary for downstream processing or analysis. The approach is less sensitive to the specific linguistic structure of the text, focusing more on producing chunks of a predefined character length.
NLTK Sentence Tokenizer and Spacy Sentence Splitter
These approaches exhibit a preference for smaller sentences but include many larger outliers. While this can result in more linguistically coherent chunks, it can also lead to high variability in chunk size.
These methods can yield good results that can serve as inputs to downstream tasks too.
Adjacent Sequence Clustering
This method generates a more varied distribution, indicative of its context-sensitive approach. By clustering based on semantic similarity, it ensures that the content within each chunk is coherent while allowing for flexibility in chunk size. This method may be advantageous when it is important to preserve the semantic continuity of text data.
For a more visual and abstract (or silly) representation, let’s look at Figure 7 below and try to figure out which kind of pineapple "cut" would better represent the approaches discussed:

Listing them in order:
- Cut number 1 would represent a rule-based approach, in which you can just "peel off" the "junk" text you want based on filters or regular expressions. Lot’s of work to do the whole pineapple tho, since it also retains a lot of outliers with a much bigger context size.
- Langchain would be like cut number 2. Very similar pieces in size but not holding the entire desired context (it’s a triangle, so it could be a watermelon as well).
- Cut number 3 is definitely KMeans. You may even group only what makes sense for you – the juiciest part – but you won’t get its core. Without it, the chunks lose all the structure and meaning. I think it takes a lot of work to do that as well… especially for bigger pineapples.
- Lastly, cut number 4 illustrates the Adjacent Sentence Clustering method. The size of the chunks can vary but they often maintain contextual information, similar to uneven pineapple pieces that still indicate the fruit’s overall structure.
TL;DR: In this article, we’ve compared three Text Chunking methods and their unique benefits. Langchain offers consistent chunk sizes, but the linguistic structure takes a back seat. NLTK and Spacy give linguistically coherent chunks, yet the size varies considerably. Adjacent Sequence Clustering clusters based on semantic similarity, providing content coherence with flexible chunk sizes. Ultimately, the optimal choice hinges on your specific needs, including linguistic coherence, uniformity in chunk size, and available computational power.
Thank you for reading!
- Follow me on Linkedin!