findTextChunkContext
Syntax
Description
Use findTextChunkContext to identify context text chunks
surrounding target text chunks.
The surrounding text chunks can provide useful context for the target chunks. For example, in retrieval-augmented generation (RAG), after you have retrieved the relevant text chunks, you can provide additional context to a large language model by adding text chunks surrounding the retrieved chunks.
finds text chunks surrounding the target text chunks at
idxContext = findTextChunkContext(chunkTable,idx)chunkTable(idx,:).
specifies additional options using one or more name-value arguments.idxContext = findTextChunkContext(chunkTable,Name=Value)
Examples
Load the example data. The file sonnets.txt contains Shakespeare's sonnets in plain text. Extract the text from sonnets.txt using the extractFileText function.
str = extractFileText("sonnets.txt");Split str into text chunks using the splitTextChunks function. Specify the target length as 100.
chunkTable = splitTextChunks(str,TargetLength=100); summary(chunkTable)
chunkTable: 1277×1 table
Variables:
Text: string
Statistics for applicable variables:
NumMissing
Text 0
Specify the text chunk index.
idx = 10;
Specify a target length of 300 characters. The findTextChunkContext function returns the indices of the target text chunk and surrounding context chunks, such that the total length of the target chunk plus the context chunks is 300 characters or less.
findTextChunkContext(chunkTable,idx,TargetLength=300)
ans = 1×4
8 9 10 11
Input Arguments
Input table of text chunks. chunkTable must contain a variable
named Text, specified as a string scalar that contains the text
chunks.
Create a table of text chunks from a document or table of documents by using the
splitTextChunks, splitHTMLSections, splitMarkdownSections, or
splitMarkdownSections function.
Index of the text chunk, or indices of the text chunks, for which to find context, specified as a positive integer or array of positive integers.
Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN, where Name is
the argument name and Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: findTextChunkContext(chunkTable,idx,TargetLength=100) sets
the total target length of the output text chunks to 100.
Total target length of output text chunks, specified as a positive integer that represents the number of characters.
Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64
Complex Number Support: Yes
Section levels within which to find text chunk context, specified as a string array, character vector, or cell array of character vectors.
If you specify this argument, then the function only identifies text chunks as
contextual if they are part of the same section, that is, if they have all of the same
entries as the target text chunk in the columns specified by
Levels.
If you do not specify this argument, then the function uses every table variable
except for Text as section levels. For example, if a table has variables
Text, H1, and H2, then by
default, the function uses Levels = ["H1","H2"].
If chunkTable does not have a variable specified by
Levels, then findTextChunkContext ignores
that variable. For example, if you specify Labels = ["H1","H7"] and
there is no column header named "H7", then the function uses
Levels = "H1".
Example: ["H1","H2"]
Data Types: char | string | cell
Output Arguments
Indices of context text chunks and the target chunk or chunks, returned as a positive integer or as an array of positive integers.
More About
Many analysis tools, including large language models (LLMs), perform better on small chunks of text than on large documents. Text Analytics Toolbox™ includes a range of functions that allow you to split large documents into semantically meaningful chunks.
The splitTextChunks function splits a document recursively into text chunks
of a given target length. The function first splits a document into paragraphs. If any
of the paragraphs are longer than the target length, then the function splits those
paragraphs into sentences, and so on.
chunks = splitTextChunks(str);
Split your document into sections and preserve the section metadata using one of these functions:
splitHTMLSectionsSplit an HTML-formatted document into HTML sections according to the section tags
<h1>...</h1>,<h2>...</h2>, …,<h6>...</h6>.splitMarkdownSectionsSplit a Markdown-formatted document into Markdown sections, for example according to ATX section tags #,##, …,######.splitCustomSectionsSplit a document into custom sections according to custom section delimiters. Split your documents or your chunks recursively into paragraphs, sentences, and tokens using the
splitTextChunksfunction.To avoid redundancy, join similar adjacent chunks using the
joinSimilarTextChunksfunction.Add overlap between adjacent text chunks using the
addTextChunkOverlapfunction. Adding text chunk overlap avoids changing the meaning of sentences by splitting at inopportune points, for example, splitting the sentence "I would never say I love cats" into "I would never say" and "I love cats." Adding overlap in this example results in the two chunks "I would never say I love" and "never say I love cats." You can also add surrounding text to individual chunks as context by using thefindTextChunkContextfunction.
For an example showing the advanced workflow, see Split Document Into Semantically Meaningful Text Chunks.
RAG combines the text generation capabilities of large language models (LLMs) with reliable information contained in a set of source documents. First, retrieve documents relevant to the user prompt from the set of source documents. Then, append the relevant document to the prompt and use the LLM to generate a response.
To improve the quality of the generated output, split large documents into smaller, semantically meaningful chunks.
Use information retrieval to identify the text chunks that are relevant to the query. For more information, see Information Retrieval with Document Embeddings.
Create a prompt based on the most relevant chunks. To provide the LLM with additional context, you can add text from adjacent prompts within the same section by using the
findTextChunkContextfunction, or you can you can add overlap between text chunks before information retrieval by using theaddTextChunkOverlapfunction. Create a Markdown-formatted string from the text chunks using theformatTextChunksfunction. For an example, see Create Large Language Model (LLM) Prompt from Text Chunk.Generate an answer using an LLM. To connect to large language model APIs using MATLAB, use the Large Language Models (LLMs) with MATLAB add-on.
Version History
Introduced in R2026a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Web サイトの選択
Web サイトを選択すると、翻訳されたコンテンツにアクセスし、地域のイベントやサービスを確認できます。現在の位置情報に基づき、次のサイトの選択を推奨します:
また、以下のリストから Web サイトを選択することもできます。
最適なサイトパフォーマンスの取得方法
中国のサイト (中国語または英語) を選択することで、最適なサイトパフォーマンスが得られます。その他の国の MathWorks のサイトは、お客様の地域からのアクセスが最適化されていません。
南北アメリカ
- América Latina (Español)
- Canada (English)
- United States (English)
ヨーロッパ
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)