メインコンテンツ

findTextChunkContext

Find text chunk context

Since R2026a

    Description

    Use findTextChunkContext to identify context text chunks surrounding target text chunks.

    The surrounding text chunks can provide useful context for the target chunks. For example, in retrieval-augmented generation (RAG), after you have retrieved the relevant text chunks, you can provide additional context to a large language model by adding text chunks surrounding the retrieved chunks.

    idxContext = findTextChunkContext(chunkTable,idx) finds text chunks surrounding the target text chunks at chunkTable(idx,:).

    example

    idxContext = findTextChunkContext(chunkTable,Name=Value) specifies additional options using one or more name-value arguments.

    Examples

    collapse all

    Load the example data. The file sonnets.txt contains Shakespeare's sonnets in plain text. Extract the text from sonnets.txt using the extractFileText function.

    str = extractFileText("sonnets.txt");

    Split str into text chunks using the splitTextChunks function. Specify the target length as 100.

    chunkTable = splitTextChunks(str,TargetLength=100);
    summary(chunkTable)
    chunkTable: 1277×1 table
    
    Variables:
    
        Text: string
    
    Statistics for applicable variables:
    
                NumMissing
    
        Text        0     
    

    Specify the text chunk index.

    idx = 10;

    Specify a target length of 300 characters. The findTextChunkContext function returns the indices of the target text chunk and surrounding context chunks, such that the total length of the target chunk plus the context chunks is 300 characters or less.

    findTextChunkContext(chunkTable,idx,TargetLength=300)
    ans = 1×4
    
         8     9    10    11
    
    

    Input Arguments

    collapse all

    Input table of text chunks. chunkTable must contain a variable named Text, specified as a string scalar that contains the text chunks.

    Create a table of text chunks from a document or table of documents by using the splitTextChunks, splitHTMLSections, splitMarkdownSections, or splitMarkdownSections function.

    Index of the text chunk, or indices of the text chunks, for which to find context, specified as a positive integer or array of positive integers.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Name-Value Arguments

    collapse all

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: findTextChunkContext(chunkTable,idx,TargetLength=100) sets the total target length of the output text chunks to 100.

    Total target length of output text chunks, specified as a positive integer that represents the number of characters.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64
    Complex Number Support: Yes

    Section levels within which to find text chunk context, specified as a string array, character vector, or cell array of character vectors.

    If you specify this argument, then the function only identifies text chunks as contextual if they are part of the same section, that is, if they have all of the same entries as the target text chunk in the columns specified by Levels.

    If you do not specify this argument, then the function uses every table variable except for Text as section levels. For example, if a table has variables Text, H1, and H2, then by default, the function uses Levels = ["H1","H2"].

    If chunkTable does not have a variable specified by Levels, then findTextChunkContext ignores that variable. For example, if you specify Labels = ["H1","H7"] and there is no column header named "H7", then the function uses Levels = "H1".

    Example: ["H1","H2"]

    Data Types: char | string | cell

    Output Arguments

    collapse all

    Indices of context text chunks and the target chunk or chunks, returned as a positive integer or as an array of positive integers.

    More About

    collapse all

    Version History

    Introduced in R2026a