This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Correct Spelling Using Edit Distance Searchers

This example shows how to correct spelling using edit distance searchers and a vocabulary of known words.

If you have misspelled words in a collection of text, then you can use edit distance searchers to find the nearest correctly spelled words to a given vocabulary. To correct the spelling of misspelled words in documents, replace them with the nearest neighbors in the vocabulary.

Lemmatization with normalizeWords and word2vec requires correctly spelled words to work. Use edit distance searchers to find the nearest correctly spelled word to misspelled words according to an edit distance. For example, the number of adjacent grapheme swaps and grapheme insertions, deletions, and substitutions.

Load Data

Create a vocabulary of known words. Download the Spell Checking Oriented Word Lists (SCOWL) from https://sourceforge.net/projects/wordlist/. Import the words from the downloaded data using the supporting function scowlWordList.

folderName = "scowl-2018.04.16";
maxSize = 60;
vocabulary = scowlWordList(folderName,'english',maxSize);

View the number of words in the vocabulary.

numWords = numel(vocabulary)
numWords = 98129

Create Simple Spelling Corrector

Using the imported vocabulary, create an edit distance searcher with a maximum distance of 2. For better results, allow for adjacent grapheme swaps by setting the 'SwapCost' option to 1. For large vocabularies, this can take a few minutes.

maxDist = 2;
eds = editDistanceSearcher(vocabulary,maxDist,'SwapCost',1);

This edit distance searcher is case sensitive which means that changing the case of characters contributes to the edit distance. For example, the searcher can find the neighbor "testing" for the word "tseting" because it has edit distance 1 (one swap), but not of the word "TSeTiNG" because it has edit distance 6.

Correct Spelling

Correct the spelling of misspelled words in an array of tokenized documents by selecting the misspelled words and finding the nearest neighbors in the edit distance searcher.

Create a tokenized document object containing typos and spelling mistakes.

str = "An exmaple dccoument with typos and averyunusualword.";
document = tokenizedDocument(str)
document = 
  tokenizedDocument:

   8 tokens: An exmaple dccoument with typos and averyunusualword .

Convert the documents to a string array of words using the string function.

words = string(document)
words = 1×8 string array
    "An"    "exmaple"    "dccoument"    "with"    "typos"    "and"    "averyunusualword"    "."

Find the words that need correction. To ignore words that are correctly spelled, find the indices of the words already in the vocabulary. To ignore punctuation and complex tokens such as email addresses, find the indices of the words which do not have the token types "letters" or "other". Get the token details from the document using the tokenDetails function.

tdetails = tokenDetails(document);
idxVocabularyWords = ismember(tdetails.Token,eds.Vocabulary);

idxComplexTokens = ...
    tdetails.Type ~= "letters" & ...
    tdetails.Type ~= "other";

idxWordsToCheck = ...
    ~idxVocabularyWords & ...
    ~idxComplexTokens
idxWordsToCheck = 8×1 logical array

   1
   1
   1
   0
   0
   0
   1
   0

Find the numeric indices of the words and view the corresponding words.

idxWordsToCheck = find(idxWordsToCheck)
idxWordsToCheck = 4×1

     1
     2
     3
     7

wordsToCheck = words(idxWordsToCheck)
wordsToCheck = 1×4 string array
    "An"    "exmaple"    "dccoument"    "averyunusualword"

Notice that the word "An" is flagged as a word to check. This word is flagged because the vocabulary does not contain the word "An" with an uppercase "A". A later section in the example shows how to create a case insensitive spelling corrector.

Find the nearest words and their distances using the knnsearch function with the edit distance searcher.

[idxNearestWords,d] = knnsearch(eds,wordsToCheck)
idxNearestWords = 4×1

         165
        1353
        1152
         NaN

d = 4×1

     1
     1
     2
   Inf

If any of the words are not found in the searcher, then the function returns index NaN with distance Inf. The word "averyunusualword" does not have a match within edit distance 2, so the function returns the index NaN for that word.

Find the indices of the words with positive finite edit distances.

idxMatches = ~isnan(idxNearestWords)
idxMatches = 4×1 logical array

   1
   1
   1
   0

Get the indices of the words with matches in the searcher and view the corresponding corrected words in the vocabulary.

idxCorrectedWords = idxNearestWords(idxMatches)
idxCorrectedWords = 3×1

         165
        1353
        1152

correctedWords = eds.Vocabulary(idxCorrectedWords)
correctedWords = 1×3 string array
    "an"    "example"    "document"

Replace the misspelled words that have matches with the corrected words.

idxToCorrect = idxWordsToCheck(idxMatches);
words(idxToCorrect) = correctedWords
words = 1×8 string array
    "an"    "example"    "document"    "with"    "typos"    "and"    "averyunusualword"    "."

To create a tokenized document of these words, use the tokenizedDocument function and set 'TokenizedMethod' to 'none'.

document = tokenizedDocument(words,'TokenizeMethod','none')
document = 
  tokenizedDocument:

   8 tokens: an example document with typos and averyunusualword .

The next section shows how to correct the spelling of multiple documents at once by creating a custom spelling correction function and using docfun.

Create Spelling Correction Function

To correct the spelling in multiple documents at once, create a custom function using the code from the previous section and use this function with the docfun function.

Create a function that takes an edit distance searcher, a string array of words, and the corresponding table of token details as inputs and outputs the corrected words. The correctSpelling function, listed at the end of the example, corrects the spelling in a string array of words using the corresponding token details and an edit distance searcher.

To use this function with the docfun function, create a function handle that takes a string array of words and the corresponding table of token details as the inputs.

func = @(words,tdetails) correctSpelling(eds,words,tdetails);

Correct the spelling of an array of tokenized documents using docfun with the function handle func.

str = [
    "Here is some reallyu badly wrirten texct."
    "Some moree mitsakes here too."];
documents = tokenizedDocument(str);
documentsCorrected = docfun(func,documents)
documentsCorrected = 
  2×1 tokenizedDocument:

    8 tokens: here is some really badly written text .
    6 tokens: come more mistakes here too .

Note that uppercase characters can get corrected to different lowercase characters. For example, the word "Some" can get corrected to "come". If multiple words in the edit distance searcher vocabulary have the same edit distance to the input, then the function outputs the first result it found. For example, the words "come" and "some" both have edit distance 1 from the word "Some".

The next section shows how to create an spelling corrector that is case insensitive.

Create Case Insensitive Spelling Corrector

To prevent differences in case clashing with other substitutions, create an edit distance searcher with the vocabulary in lower case and convert the documents to lowercase before using the edit distance searcher.

Convert the vocabulary to lowercase. This operation can introduce duplicate words, remove them by taking the unique values only.

vocabularyLower = lower(vocabulary);
vocabularyLower = unique(vocabularyLower);

Create an edit distance searcher using the lowercase vocabulary using the same options as before. This can take a few minutes to run.

maxDist = 2;
eds = editDistanceSearcher(vocabularyLower,maxDist,'SwapCost',1);

Use the edit distance searcher to correct the spelling of the words in tokenized document. To use the case insensitive spelling corrector, convert the documents to lowercase.

documentsLower = lower(documents);

Correct the spelling using the new edit distance searcher using same steps as before.

func = @(words,tdetails) correctSpelling(eds,words,tdetails);
documentsCorrected = docfun(func,documentsLower)
documentsCorrected = 
  2×1 tokenizedDocument:

    8 tokens: here is some really badly written text .
    6 tokens: some more mistakes here too .

Here, the word "Some" in the original text is converted to "some" before being input to the spelling corrector. The corresponding word "some" is unaffected by the searcher as the word some occurs in the vocabulary.

Spelling Correction Function

The correctSpelling function corrects the spelling in a string array of words using the corresponding token details and an edit distance searcher. You can use this function with docfun to correct the spelling of multiple documents at once.

function words = correctSpelling(eds,words,tdetails)

% Get indices of misspelled words ignoring complex tokens.
idxVocabularyWords = ismember(tdetails.Token,eds.Vocabulary);

idxComplexTokens = ...
    tdetails.Type ~= "letters" & ...
    tdetails.Type ~= "other";

idxWordsToCheck = ...
    ~idxVocabularyWords & ...
    ~idxComplexTokens;

% Convert to numeric indices.
idxWordsToCheck = find(idxWordsToCheck);

% Find nearest words.
wordsToCheck = words(idxWordsToCheck);
idxNearestWords = knnsearch(eds,wordsToCheck);

% Find words with matches.
idxMatches = ~isnan(idxNearestWords);

% Get corrected words.
idxCorrectedWords = idxNearestWords(idxMatches);
correctedWords = eds.Vocabulary(idxCorrectedWords);

% Correct words.
idxToCorrect = idxWordsToCheck(idxMatches);
words(idxToCorrect) = correctedWords;

end

See Also

| | | | |

Related Topics