regexprep
Replace text in words of documents using regular expression
Description
Text Analytics Toolbox™ provides functions for common text preprocessing steps. For example, to remove
punctuation and symbol characters, use erasePunctuation
or to remove stem words using the Porter stemmer, use normalizeWords
. For more information, see Text Data Preparation.
replaces all occurrences of the regular expression newDocuments
= regexprep(documents
,expression
,replace
)expression
in the words of documents
with the text in
replace
.
The function matches each word independently. The match does not have to span the whole word.
Examples
Update Text in Words
Replace words that begin with "s"
, end "e"
, and have at least one character between them. To match whole words, use "^"
to match the start of a word and "$"
to match the end of the word.
documents = tokenizedDocument([ ... "an example of a short sentence" "a second short sentence"])
documents = 2x1 tokenizedDocument: 6 tokens: an example of a short sentence 4 tokens: a second short sentence
expression = "^s(\w+)e$"; replace = "thing"; newDocuments = regexprep(documents,expression,replace)
newDocuments = 2x1 tokenizedDocument: 6 tokens: an example of a short thing 4 tokens: a second short thing
If you do not use "^"
and "$"
, then you can match substrings of the words. Replace all vowels with "_".
expression = "[aeiou]"; replace = "\_"; newDocuments = regexprep(documents,expression,replace)
newDocuments = 2x1 tokenizedDocument: 6 tokens: _n _x_mpl_ _f _ sh_rt s_nt_nc_ 4 tokens: _ s_c_nd sh_rt s_nt_nc_
Include Captured Tokens in Word Replacement
Replace variations of the word "walk"
by capturing the letters that follow "walk"
.
documents = tokenizedDocument([ "I walk" "they walked" "we are walking"])
documents = 3x1 tokenizedDocument: 2 tokens: I walk 2 tokens: they walked 3 tokens: we are walking
expression = "walk(\w*)"; replace = "ascend$1"; newDocuments = regexprep(documents,expression,replace)
newDocuments = 3x1 tokenizedDocument: 2 tokens: I ascend 2 tokens: they ascended 3 tokens: we are ascending
Input Arguments
documents
— Input documents
tokenizedDocument
array
Input documents, specified as a tokenizedDocument
array.
expression
— Regular expression
character vector | cell array of character vectors | string array
Regular expression, specified as a character vector, a cell
array of character vectors, or a string array. Each expression can
contain characters, metacharacters, operators, tokens, and flags that
specify patterns to match in str
.
The following tables describe the elements of regular expressions.
Metacharacters
Metacharacters represent letters, letter ranges, digits, and space characters. Use them to construct a generalized pattern of characters.
Metacharacter | Description | Example |
---|---|---|
| Any single character, including white space |
|
| Any character contained within the square brackets. The following characters are treated
literally: |
|
| Any character not contained within the square brackets. The following characters are treated
literally: |
|
| Any character in the range of |
|
| Any alphabetic, numeric, or underscore character. For
English character sets, |
|
| Any character that is not alphabetic, numeric, or underscore.
For English character sets, |
|
| Any white-space character; equivalent to |
|
| Any non-white-space character; equivalent to |
|
| Any numeric digit; equivalent to |
|
| Any nondigit character; equivalent to |
|
| Character of octal value |
|
| Character of hexadecimal value |
|
Character Representation
Operator | Description |
---|---|
| Alarm (beep) |
| Backspace |
| Form feed |
| New line |
| Carriage return |
| Horizontal tab |
| Vertical tab |
| Any character with special meaning in regular expressions
that you want to match literally (for example, use |
Quantifiers
Quantifiers specify the number of times a pattern must occur in the matching text.
Quantifier | Number of Times Expression Occurs | Example |
---|---|---|
| 0 or more times consecutively. |
|
| 0 times or 1 time. |
|
| 1 or more times consecutively. |
|
| At least
|
|
| At least
|
|
| Exactly Equivalent
to |
|
Quantifiers can appear in three modes, described in the following table. q represents any of the quantifiers in the previous table.
Mode | Description | Example |
---|---|---|
| Greedy expression: match as many characters as possible. | Given the text
|
| Lazy expression: match as few characters as necessary. | Given the text
|
| Possessive expression: match as much as possible, but do not rescan any portions of the text. | Given the text |
Grouping Operators
Grouping operators allow you to capture tokens, apply one operator to multiple elements, or disable backtracking in a specific group.
Grouping Operator | Description | Example |
---|---|---|
| Group elements of the expression and capture tokens. |
|
| Group, but do not capture tokens. |
Without
grouping, |
| Group atomically. Do not backtrack within the group to complete the match, and do not capture tokens. |
|
| Match expression If
there is a match with You can include |
|
Anchors
Anchors in the expression match the beginning or end of the input text or word.
Anchor | Matches the... | Example |
---|---|---|
| Beginning of the input text. |
|
| End of the input text. |
|
| Beginning of a word. |
|
| End of a word. |
|
Lookaround Assertions
Lookaround assertions look for patterns that immediately precede or follow the intended match, but are not part of the match.
The pointer remains at the current location, and characters
that correspond to the test
expression are not
captured or discarded. Therefore, lookahead assertions can match overlapping
character groups.
Lookaround Assertion | Description | Example |
---|---|---|
| Look ahead for characters that match |
|
| Look ahead for characters that do not match |
|
| Look behind for characters that match |
|
| Look behind for characters that do not match |
|
If you specify a lookahead assertion before an
expression, the operation is equivalent to a logical AND
.
Operation | Description | Example |
---|---|---|
| Match both |
|
| Match |
|
Logical and Conditional Operators
Logical and conditional operators allow you to test the state
of a given condition, and then use the outcome to determine which
pattern, if any, to match next. These operators support logical OR
,
and if
or if/else
conditions.
Conditions can be tokens, lookaround operators, or dynamic expressions
of the form (?@cmd)
. Dynamic expressions must return
a logical or numeric value.
Conditional Operator | Description | Example |
---|---|---|
| Match expression If
there is a match with |
|
| If condition |
|
| If condition |
|
Token Operators
Tokens are portions of the matched text that you define by enclosing part of the regular expression in parentheses. You can refer to a token by its sequence in the text (an ordinal token), or assign names to tokens for easier code maintenance and readable output.
Ordinal Token Operator | Description | Example |
---|---|---|
| Capture in a token the characters that match the enclosed expression. |
|
| Match the |
|
| If the |
|
Named Token Operator | Description | Example |
---|---|---|
| Capture in a named token the characters that match the enclosed expression. |
|
| Match the token referred to by |
|
| If the named token is found, then match |
|
Note
If an expression has nested parentheses, MATLAB® captures
tokens that correspond to the outermost set of parentheses. For example,
given the search pattern '(and(y|rew))'
, MATLAB creates
a token for 'andrew'
but not for 'y'
or 'rew'
.
Dynamic Regular Expressions
Dynamic expressions allow you to execute a MATLAB command or a regular expression to determine the text to match.
The parentheses that enclose dynamic expressions do not create a capturing group.
Operator | Description | Example |
---|---|---|
| Parse When parsed, |
|
| Execute the MATLAB command represented by |
|
| Execute the MATLAB command represented by |
|
Within dynamic expressions, use the following operators to define replacement text.
Replacement Operator | Description |
---|---|
| Portion of the input text that is currently a match |
| Portion of the input text that precedes the current match |
| Portion of the input text that follows the current match
(use |
|
|
| Named token |
| Output returned when MATLAB executes the command, |
Comments
Characters | Description | Example |
---|---|---|
(?#comment) | Insert a comment in the regular expression. The comment text is ignored when matching the input. |
|
Search Flags
Search flags modify the behavior for matching expressions. An
alternative to using a search flag within an expression is to pass
an option
input argument.
Flag | Description |
---|---|
(?-i) | Match letter case (default for |
(?i) | Do not match letter case (default for |
(?s) | Match dot ( |
(?-s) | Match dot in the pattern with any character that is not a newline character. |
(?-m) | Match the |
(?m) | Match the |
(?-x) | Include space characters and comments when matching (default). |
(?x) | Ignore space characters and comments when matching. Use |
The expression that the flag modifies can appear either after the parentheses, such as
(?i)\w*
or inside the parentheses and separated from the flag with a
colon (:
), such as
(?i:\w*)
The latter syntax allows you to change the behavior for part of a larger expression.
Data Types: char
| cell
| string
replace
— Replacement text
character vector | cell array of character vectors | string array
Replacement text, specified as a character vector, a cell array of character vectors, or a string array, as follows:
If
replace
is a single character vector andexpression
is a cell array of character vectors, thenregexprep
uses the same replacement text for each expression.If
replace
is a cell array ofN
character vectors andexpression
is a single character vector, thenregexprep
attemptsN
matches and replacements.If both
replace
andexpression
are cell arrays of character vectors, then they must contain the same number of elements.regexprep
pairs eachreplace
element with its corresponding element inexpression
.
The replacement text can include regular characters, special characters (such as tabs or new lines), or replacement operators, as shown in the following tables.
Replacement Operator | Description |
---|---|
| Portion of the input text that is currently a match |
| Portion of the input text that precedes the current match |
| Portion of the input text that follows the current match
(use |
|
|
| Named token |
| Output returned when MATLAB executes the command, |
Operator | Description |
---|---|
| Alarm (beep) |
| Backspace |
| Form feed |
| New line |
| Carriage return |
| Horizontal tab |
| Vertical tab |
| Any character with special meaning in regular expressions
that you want to match literally (for example, use |
Data Types: char
| cell
| string
Output Arguments
newDocuments
— Output documents
tokenizedDocument
array
Output documents, returned as a tokenizedDocument
array.
Tips
Text Analytics Toolbox provides functions for common text preprocessing steps. For example, to remove punctuation and symbol characters, use
erasePunctuation
or to remove stem words using the Porter stemmer, usenormalizeWords
. For more information, see Text Data Preparation.
Version History
Introduced in R2017b
MATLAB コマンド
次の MATLAB コマンドに対応するリンクがクリックされました。
コマンドを MATLAB コマンド ウィンドウに入力して実行してください。Web ブラウザーは MATLAB コマンドをサポートしていません。
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)