textanalytics.unicode.nfc

Unicode composed normalized form (NFC)

Since R2022b

collapse all in page

Syntax

newStr = textanalytics.unicode.nfc(str)

Description

example

newStr = textanalytics.unicode.nfc(str) normalizes the string str to the Unicode canonical composition form (NFC).

Examples

collapse all

Normalize String to Unicode Canonical Composition Form

Open Live Script

Strings that look identical can have different underlying representations. The Unicode canonical composition form (NFC) ensures that equivalent strings have a unique binary representation.

Consider the string "jalapeño", where the character "ñ" is represented as the character "n" followed by the code unit "\x0303", which corresponds to the diacritic "~". On some systems, the character "ñ" appears as two characters. The string has length 9.

str = compose("jalapen\x0303o")

str = 
"jalapeño"

strlength(str)

ans = 9

Normalize the string using the textanalytics.unicode.nfc function. On some systems, the output string appears to be identical to the input string.

newStr = textanalytics.unicode.nfc(str)

newStr = 
"jalapeño"

View the length of the normalized string. The normalized representation includes one fewer code units. In this case, the function merges the letter "n" and the diacritic "~" into a single code unit that represents "ñ".

strlength(newStr)

ans = 8

Extract the seventh code unit of the normalized string.

extractBetween(newStr,7,7)

ans = 
"ñ"

Check whether str and newStr are equal using the == operator. The operator returns 0 because the strings have different underlying representations.

tf = str == newStr

tf = logical
   0

Input Arguments

collapse all

`str` — Input text
string array | character vector | cell array of character vectors

Input text, specified as a string array, character vector, or cell array of character vectors.

Example: ["An example of a short sentence."; "A second short sentence."]

Data Types: string | char | cell

Output Arguments

collapse all

`newStr` — Output text
string array | character vector | cell array of character vectors

Output text, returned as a string array, character vector, or cell array of character vectors. str and newStr have the same data type.

Algorithms

collapse all

Unicode Normalization Forms

For more information about Unicode normalization forms, see Unicode Standard Annex #15 Unicode Normalization Forms.

References

[1] Whistler, Ken, ed. "Unicode Standard Annex #15: Unicode Normalization Forms." Unicode Technical Reports, August 27, 2021. https://unicode.org/reports/tr15/.

Version History

Introduced in R2022b

textanalytics.unicode.nfc

Syntax

Description

Examples

Normalize String to Unicode Canonical Composition Form

Input Arguments

`str` — Input text
string array | character vector | cell array of character vectors

Output Arguments

`newStr` — Output text
string array | character vector | cell array of character vectors

Algorithms

Unicode Normalization Forms

References

Version History

See Also

Topics

textanalytics.unicode.nfc

Syntax

Description

Examples

Normalize String to Unicode Canonical Composition Form

Input Arguments

str — Input text string array | character vector | cell array of character vectors

Output Arguments

newStr — Output text string array | character vector | cell array of character vectors

Algorithms

Unicode Normalization Forms

References

Version History

See Also

Topics

`str` — Input text
string array | character vector | cell array of character vectors

`newStr` — Output text
string array | character vector | cell array of character vectors