File size: 10,672 Bytes
0d39cc3 601ff24 0d39cc3 601ff24 c8bd9c5 601ff24 93c0a02 601ff24 93c0a02 b78862e 93c0a02 b78862e 93c0a02 b78862e 93c0a02 b78862e 93c0a02 b78862e 93c0a02 601ff24 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
---
license: mit
library_name: sentence-transformers
pipeline_tag: text-classification
---
This model borrows from Greg Kamradt’s work here: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb. The idea is to segment text into semantically coherent chunks. The primary goal of this work is to use sentence transformers embeddings to represent the meaning of sentences and detect shifts in meaning to identify potential breakpoints between chunks.
### Model Description
This model aims to segment a text into semantically coherent chunks. It uses sentence-transformers embeddings to represent the meaning of sentences and detect shifts in the meaning to identify potential breakpoints between chunks. There are two primary changes in function from Greg Kamradt excellent original work: 1) Introduce the use of sentence-transformer embeddings rather than OpenAI to provide an entirely open source implementation of semantic chunking, and 2) add functionality to merge smaller chunks with their most semantically similar neighbors to better normalize chunk size.
The goal is to use semantic Understanding to enable a model to consider the meaning of text segments rather than purely relying on punctuation or syntax, and to provide flexiblity so that the breakpoint_percentile_threshold and min_chunk_size can be adjusted to influence the granularity of the chunks.
General Outline:
Preprocessing
- Loading Text: Reads the text from the specified path.
- Sentence Tokenization: Splits the text into a list of individual sentences using nltk's sentence tokenizer.
Semantic Embeddings
- Model Loading: Loads a pre-trained Sentence Transformer model ( in this case: 'sentence-transformers/all-mpnet-base-v1').
- Embedding Generation: Converts each sentence into an embedding to represent its meaning.
Sentence Combination:
- Combines each sentence with its neighbors to form slightly larger units, helping the model understand the context in which changes of topic are likely to occur.
Breakpoint Identification
- Cosine Distance: Calculates cosine distances between embeddings of the combined sentences. These distances represent the degree of semantic dissimilarity.
- Percentile-Based Threshold: Determines a threshold based on a percentile of the distances (e.g., 95th percentile), where higher values indicate more significant semantic shifts.
- Locating Breaks: Identifies the indices of distances above the threshold, which mark potential breakpoints between chunks.
Chunk Creation:
- Splitting at Breakpoints: Divides the original sentences into chunks based on the identified breakpoints.
Chunk Merging:
- Minimum Chunk Size: Defines a minimum number of sentences to consider a chunk valid.
- Similarity-Based Merging: Merges smaller chunks with their most semantically similar neighbor based on cosine similarity between chunk embeddings.
Output:
- The model ultimately produces a list of text chunks (chunks), each representing a somewhat self-contained, semantically cohesive segment of the original text.
## Usage
Using this chunker is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
pip install -U sentence-transformers
```
Then you can implement like this:
```python
"""
Text Chunking Utility
This module provides functionality to intelligently chunk text documents into semantically coherent sections
using sentence embeddings and cosine similarity. It's particularly useful for processing large documents
while maintaining contextual relationships between sentences.
Requirements:
- nltk
- sentence-transformers
- scikit-learn
- numpy
- matplotlib
"""
import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import matplotlib.pyplot as plt
class TextChunker:
def __init__(self, model_name='sentence-transformers/all-mpnet-base-v1'):
"""Initialize the TextChunker with a specified sentence transformer model."""
self.model = SentenceTransformer(model_name)
def process_file(self, file_path, context_window=1, percentile_threshold=95, min_chunk_size=3):
"""
Process a text file and split it into semantically meaningful chunks.
Args:
file_path: Path to the text file
context_window: Number of sentences to consider on either side for context
percentile_threshold: Percentile threshold for identifying breakpoints
min_chunk_size: Minimum number of sentences in a chunk
Returns:
list: Semantically coherent text chunks
"""
# Process the text file
sentences = self._load_text(file_path)
contextualized = self._add_context(sentences, context_window)
embeddings = self.model.encode(contextualized)
# Create and refine chunks
distances = self._calculate_distances(embeddings)
breakpoints = self._identify_breakpoints(distances, percentile_threshold)
initial_chunks = self._create_chunks(sentences, breakpoints)
# Merge small chunks for better coherence
chunk_embeddings = self.model.encode(initial_chunks)
final_chunks = self._merge_small_chunks(initial_chunks, chunk_embeddings, min_chunk_size)
return final_chunks
def _load_text(self, file_path):
"""Load and tokenize text from a file."""
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
return sent_tokenize(text)
def _add_context(self, sentences, window_size):
"""Combine sentences with their neighbors for better context."""
contextualized = []
for i in range(len(sentences)):
start = max(0, i - window_size)
end = min(len(sentences), i + window_size + 1)
context = ' '.join(sentences[start:end])
contextualized.append(context)
return contextualized
def _calculate_distances(self, embeddings):
"""Calculate cosine distances between consecutive embeddings."""
distances = []
for i in range(len(embeddings) - 1):
similarity = cosine_similarity([embeddings[i]], [embeddings[i + 1]])[0][0]
distance = 1 - similarity
distances.append(distance)
return distances
def _identify_breakpoints(self, distances, threshold_percentile):
"""Find natural breaking points in the text based on semantic distances."""
threshold = np.percentile(distances, threshold_percentile)
return [i for i, dist in enumerate(distances) if dist > threshold]
def _create_chunks(self, sentences, breakpoints):
"""Create initial text chunks based on identified breakpoints."""
chunks = []
start_idx = 0
for breakpoint in breakpoints:
chunk = ' '.join(sentences[start_idx:breakpoint + 1])
chunks.append(chunk)
start_idx = breakpoint + 1
# Add the final chunk
final_chunk = ' '.join(sentences[start_idx:])
chunks.append(final_chunk)
return chunks
def _merge_small_chunks(self, chunks, embeddings, min_size):
"""Merge small chunks with their most similar neighbor."""
final_chunks = [chunks[0]]
merged_embeddings = [embeddings[0]]
for i in range(1, len(chunks) - 1):
current_chunk_size = len(chunks[i].split('. '))
if current_chunk_size < min_size:
# Calculate similarities
prev_similarity = cosine_similarity([embeddings[i]], [merged_embeddings[-1]])[0][0]
next_similarity = cosine_similarity([embeddings[i]], [embeddings[i + 1]])[0][0]
if prev_similarity > next_similarity:
# Merge with previous chunk
final_chunks[-1] = f"{final_chunks[-1]} {chunks[i]}"
merged_embeddings[-1] = (merged_embeddings[-1] + embeddings[i]) / 2
else:
# Merge with next chunk
chunks[i + 1] = f"{chunks[i]} {chunks[i + 1]}"
embeddings[i + 1] = (embeddings[i] + embeddings[i + 1]) / 2
else:
final_chunks.append(chunks[i])
merged_embeddings.append(embeddings[i])
final_chunks.append(chunks[-1])
return final_chunks
def main():
"""Example usage of the TextChunker class."""
# Initialize the chunker
chunker = TextChunker()
# Process a text file
file_path = "path/to/your/document.txt"
chunks = chunker.process_file(
file_path,
context_window=1,
percentile_threshold=95,
min_chunk_size=3
)
# Print results
print(f"Successfully split text into {len(chunks)} chunks")
print("\nFirst chunk preview:")
print(f"{chunks[0][:200]}...")
if __name__ == "__main__":
main()
```
## Evaluation Results
Testing on various buffer sizes and breakpoints using the King James Version of the book of Romans (Available here: https://quod.lib.umich.edu/cgi/k/kjv/kjv-idx?type=DIV1&byte=5015363).
Intra-chunk similarity (how similar the sentences in a given chunk are to each other. Higher = More Semantically Similar):
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/JtjTFuh2DhEQCDwkOrrkb.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/15RQ0-Lu8PvQ1IxxKJsGM.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/K3XfmdKXyg75n1-77DK0X.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/ojbecOyGaxDY1PTW1WOY4.png)
Inter-chunk similarity (how similar the respective chunks are to each other. Lower = Less Semantically Similar):
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/bI02qstwYwmom5Kfae34N.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/MhLrbp_AuXJMtbrrPoIiO.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/Zp-iZF_clPuxHA0CfRiJF.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/JumSdk0Vxi4zJtiCoA64B.png)
#### Citing and Authors
If you find this model helpful, please enjoy and give all credit to Greg Kamradt for the idea.
|