This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

stopWords

List of stop words

Words like "a", "and", "to", and "the" (known as stop words) can add noise to data. Use stop word lists to help create custom lists of words to remove before analysis.

To remove the default list of stop words from tokenized documents using the language details of the documents, use removeStopWords. To remove a custom list of words from tokenized documents, use removeWords.

The function supports English, Japanese, and German text. To learn how to use stopWords for other languages, see Language Considerations.

Syntax

words = stopWords
words = stopWords('Language',language)

Description

example

words = stopWords returns a string array of common English words which can be removed from documents before analysis.

example

words = stopWords('Language',language) specifies the stop word language.

Examples

collapse all

To remove the default list of stop words using the language details of documents, use removeStopWords.

To remove a custom list of stop words, use the removeWords function. You can use the stop word list returned by the stopWords function as a starting point.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

View the first few documents.

documents(1:5)
ans = 
  5x1 tokenizedDocument:

    70 tokens: fairest creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory thou contracted thine own bright eyes feedst thy lights flame selfsubstantial fuel making famine abundance lies thy self thy foe thy sweet self cruel thou art worlds fresh ornament herald gaudy spring thine own bud buriest thy content tender churl makst waste niggarding pity world else glutton eat worlds due grave thee
    71 tokens: forty winters shall besiege thy brow dig deep trenches thy beautys field thy youths proud livery gazed tatterd weed small worth held asked thy beauty lies treasure thy lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd thy beautys thou couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made thou art old thy blood warm thou feelst cold
    65 tokens: look thy glass tell face thou viewest time face form another whose fresh repair thou renewest thou dost beguile world unbless mother fair whose uneard womb disdains tillage thy husbandry fond tomb selflove stop posterity thou art thy mothers glass thee calls back lovely april prime thou windows thine age shalt despite wrinkles thy golden time thou live rememberd die single thine image dies thee
    71 tokens: unthrifty loveliness why dost thou spend upon thy self thy beautys legacy natures bequest gives nothing doth lend frank lends free beauteous niggard why dost thou abuse bounteous largess thee give profitless usurer why dost thou great sum sums yet canst live traffic thy self alone thou thy self thy sweet self dost deceive nature calls thee gone acceptable audit canst thou leave thy unused beauty tombed thee lives th executor
    61 tokens: hours gentle work frame lovely gaze every eye doth dwell play tyrants same unfair fairly doth excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet

Create a list of stop words starting with the output of the stopWords function.

customStopWords = [stopWords "thy" "thee" "thou" "dost" "doth"];

Remove the custom stop words from the documents and view the first few documents.

documents = removeWords(documents,customStopWords);
documents(1:5)
ans = 
  5x1 tokenizedDocument:

    62 tokens: fairest creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory contracted thine own bright eyes feedst lights flame selfsubstantial fuel making famine abundance lies self foe sweet self cruel art worlds fresh ornament herald gaudy spring thine own bud buriest content tender churl makst waste niggarding pity world else glutton eat worlds due grave
    61 tokens: forty winters shall besiege brow dig deep trenches beautys field youths proud livery gazed tatterd weed small worth held asked beauty lies treasure lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd beautys couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made art old blood warm feelst cold
    52 tokens: look glass tell face viewest time face form another whose fresh repair renewest beguile world unbless mother fair whose uneard womb disdains tillage husbandry fond tomb selflove stop posterity art mothers glass calls back lovely april prime windows thine age shalt despite wrinkles golden time live rememberd die single thine image dies
    52 tokens: unthrifty loveliness why spend upon self beautys legacy natures bequest gives nothing lend frank lends free beauteous niggard why abuse bounteous largess give profitless usurer why great sum sums yet canst live traffic self alone self sweet self deceive nature calls gone acceptable audit canst leave unused beauty tombed lives th executor
    59 tokens: hours gentle work frame lovely gaze every eye dwell play tyrants same unfair fairly excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet

Get a list of English stop words using the stopWords function. For readability, reshape the output.

words = stopWords;
reshape(words,[25 9])
ans = 25x9 string array
  Columns 1 through 6

    "a"          "but"         "during"     "hows"       "it's"     "said"     
    "about"      "by"          "each"       "however"    "it’s"     "says"     
    "above"      "can"         "either"     "i"          "its"      "see"      
    "across"     "can't"       "for"        "i'd"        "let's"    "she"      
    "after"      "can’t"       "from"       "i’d"        "let’s"    "she'd"    
    "all"        "cant"        "given"      "i'll"       "lets"     "she’d"    
    "along"      "cannot"      "had"        "i’ll"       "may"      "shed"     
    "also"       "could"       "has"        "i'm"        "me"       "she'll"   
    "am"         "couldn't"    "have"       "i’m"        "more"     "she’ll"   
    "an"         "couldn’t"    "having"     "im"         "most"     "shell"    
    "and"        "couldnt"     "he"         "i've"       "much"     "should"   
    "any"        "did"         "he'd"       "i’ve"       "must"     "since"    
    "are"        "didn't"      "he’d"       "ive"        "my"       "so"       
    "aren't"     "didn’t"      "hed"        "if"         "no"       "some"     
    "aren’t"     "didnt"       "he'll"      "in"         "not"      "such"     
    "arent"      "do"          "he’ll"      "instead"    "now"      "than"     
    "as"         "does"        "her"        "into"       "of"       "that"     
    "at"         "doesn't"     "here"       "is"         "on"       "the"      
    "be"         "doesn’t"     "hers"       "isn't"      "one"      "their"    
    "because"    "doesnt"      "him"        "isn’t"      "only"     "them"     
    "been"       "doing"       "himself"    "isnt"       "or"       "then"     
    "before"     "done"        "his"        "it"         "other"    "there"    
    "being"      "don't"       "how"        "it'll"      "our"      "therefore"
    "between"    "don’t"       "how's"      "it’ll"      "out"      "these"    
    "both"       "dont"        "how’s"      "itll"       "over"     "they"     

  Columns 7 through 9

    "this"       "we’re"      "who’ve"  
    "those"      "we've"      "whove"   
    "through"    "we’ve"      "will"    
    "to"         "weve"       "with"    
    "too"        "were"       "within"  
    "towards"    "what"       "without" 
    "under"      "what's"     "won't"   
    "until"      "what’s"     "won’t"   
    "us"         "whats"      "would"   
    "use"        "when"       "wouldn't"
    "used"       "when's"     "wouldn’t"
    "uses"       "when’s"     "you"     
    "using"      "whens"      "you'd"   
    "very"       "where"      "you’d"   
    "want"       "whether"    "youd"    
    "was"        "which"      "you'll"  
    "wasn't"     "while"      "you’ll"  
    "wasn’t"     "who"        "youll"   
    "wasnt"      "who'll"     "you're"  
    "we"         "who’ll"     "you’re"  
    "we'd"       "wholl"      "youre"   
    "we’d"       "who's"      "you've"  
    "we'll"      "who’s"      "you’ve"  
    "we’ll"      "whos"       "youve"   
    "we're"      "who've"     "your"    

Get a list of Japanese stop words using the stopWords function. For readability, reshape the output.

words = stopWords('Language','ja');
reshape([words strings(1,8)],[35 11])
ans = 35x11 string array
  Columns 1 through 7

    "あそこ"      "さらい"      "なかば"      "下"    "今"    "地"      "列"
    "あたり"      "さん"       "なに"       "字"    "部"    "員"      "事"
    "あちら"      "しかた"      "など"       "年"    "課"    "線"      "士"
    "あっち"      "しよう"      "なん"       "月"    "係"    "点"      "台"
    "あと"       "すか"       "はじめ"      "日"    "外"    "書"      "集"
    "あな"       "ずつ"       "はず"       "時"    "類"    "品"      "様"
    "あなた"      "すね"       "はるか"      "分"    "達"    "力"      "所"
    "あれ"       "すべて"      "ひと"       "秒"    "気"    "法"      "歴"
    "いくつ"      "ぜんぶ"      "ひとつ"      "週"    "室"    "感"      "器"
    "いつ"       "そう"       "ふく"       "火"    "口"    "作"      "名"
    "いま"       "そこ"       "ぶり"       "水"    "誰"    "元"      "情"
    "いや"       "そちら"      "べつ"       "木"    "用"    "手"      "連"
    "いろいろ"    "そっち"      "へん"       "金"    "界"    "数"      "毎"
    "うち"       "そで"       "ぺん"       "土"    "会"    "彼"      "式"
    "おおまか"    "それ"       "ほう"       "国"    "首"    "彼女"    "簿"
    "おまえ"      "それぞれ"    "ほか"       "都"    "男"    "子"      "回"
    "おれ"       "それなり"    "まさ"       "道"    "女"    "内"      "匹"
    "がい"       "たくさん"    "まし"       "府"    "別"    "楽"      "個"
    "かく"       "たち"       "まとも"      "県"    "話"    "喜"      "席"
    "かたち"      "たび"       "まま"       "市"    "私"    "怒"      "束"
    "かやの"      "ため"       "みたい"      "区"    "屋"    "哀"      "歳"
    "から"       "だめ"       "みつ"       "町"    "店"    "輪"      "目"
    "がら"       "ちゃ"       "みなさん"    "村"    "家"    "頃"      "通"
    "きた"       "ちゃん"      "みんな"      "各"    "場"    "化"      "面"
    "くせ"       "てん"       "もと"       "第"    "等"    "境"      "円"
    "ここ"       "とおり"      "もの"       "方"    "見"    "俺"      "玉"
    "こっち"      "とき"       "もん"       "何"    "際"    "奴"      "枚"
    "こと"       "どこ"       "やつ"       "的"    "観"    "高"      "前"
    "ごと"       "どこか"      "よう"       "度"    "段"    "校"      "後"
    "こちら"      "ところ"      "よそ"       "文"    "略"    "婦"      "左"
    "ごっちゃ"    "どちら"      "わけ"       "者"    "例"    "伸"      "右"
    "これ"       "どっか"      "わたし"      "性"    "系"    "紀"      "次"
    "これら"      "どっち"      "ハイ"       "体"    "論"    "誌"      "先"
    "ごろ"       "どれ"       "上"         "人"    "形"    "レ"      "春"
    "さまざま"    "なか"       "中"         "他"    "間"    "行"      "夏"

  Columns 8 through 11

    "秋"      "本当"     "う"       "どう" 
    "冬"      "確か"     "え"       "な"   
    "一"      "時点"     "お"       "ない" 
    "二"      "全部"     "か"       "なり" 
    "三"      "関係"     "が"       "なる" 
    "四"      "近く"     "こそ"     "に"   
    "五"      "方法"     "この"     "ね"   
    "六"      "我々"     "さ"       "の"   
    "七"      "違い"     "さえ"     "ので" 
    "八"      "多く"     "し"       "のに" 
    "九"      "扱い"     "しか"     "は"   
    "十"      "新た"     "する"     "ばかり"
    "百"      "その後"    "ず"       "へ"   
    "千"      "半ば"     "せる"     "ほど" 
    "万"      "結局"     "そして"    "ます" 
    "億"      "様々"     "その"     "ませ" 
    "兆"      "以前"     "た"       "また" 
    "下記"    "以後"     "たい"     "まで" 
    "上記"    "以降"     "ただ"     "も"   
    "時間"    "未満"     "だ"       "や"   
    "今回"    "以上"     "だけ"     "やら" 
    "前回"    "以下"     "だに"     "よ"   
    "場合"    "幾つ"     "だの"     "より" 
    "一つ"    "毎日"     "ち"       "れる" 
    "年生"    "自体"     "って"     "わ"   
    "自分"    "向こう"    "て"       "を"   
    "ヶ所"    "何人"     "で"       "ん"   
    "ヵ所"    "手段"     "でし"     ""     
    "カ所"    "同じ"     "です"     ""     
    "箇所"    "感じ"     "では"     ""     
    "ヶ月"    "あの"     "でも"     ""     
    "ヵ月"    "あり"     "でる"     ""     
    "カ月"    "ある"     "と"       ""     
    "箇月"    "い"       "とか"     ""     
    "名前"    "いる"     "とも"     ""     

Get a list of German stop words using the stopWords function. For readability, reshape the output.

words = stopWords('Language','de');
reshape([words strings(1,7)],[25 8])
ans = 25x8 string array
  Columns 1 through 6

    "ab"         "dann"      "doch"       "hattet"     "jene"        "mein"   
    "aber"       "das"       "du"         "her"        "jenem"       "meine"  
    "alle"       "dass"      "durch"      "hin"        "jenen"       "meinem" 
    "allem"      "daß"       "ein"        "hätte"      "jener"       "meinen" 
    "allen"      "dein"      "eine"       "hättest"    "jenes"       "meiner" 
    "aller"      "deine"     "einem"      "hättet"     "kann"        "meines" 
    "alles"      "deinem"    "einen"      "ich"        "kannst"      "mich"   
    "als"        "deiner"    "einer"      "ihm"        "kein"        "mir"    
    "also"       "deines"    "eines"      "ihn"        "keine"       "mit"    
    "am"         "dem"       "er"         "ihr"        "keinem"      "muss"   
    "an"         "den"       "es"         "ihre"       "keinen"      "musst"  
    "andere"     "denn"      "euch"       "ihrem"      "keiner"      "musste" 
    "anderem"    "der"       "euer"       "ihren"      "keines"      "muß"    
    "anderen"    "derer"     "eure"       "ihrer"      "können"      "müssen" 
    "anderer"    "des"       "eurem"      "ihres"      "könnte"      "müssten"
    "anderes"    "dessen"    "euren"      "im"         "könnten"     "nach"   
    "auch"       "dich"      "eures"      "in"         "könntest"    "nicht"  
    "auf"        "die"       "für"        "ins"        "ließ"        "nichts" 
    "aus"        "dies"      "ganz"       "ist"        "man"         "noch"   
    "bei"        "diese"     "gar"        "ja"         "manche"      "nun"    
    "bin"        "diesem"    "habe"       "jede"       "manchem"     "nur"    
    "bis"        "diesen"    "haben"      "jedem"      "manchen"     "ob"     
    "bist"       "dieser"    "hat"        "jeden"      "mancher"     "oder"   
    "da"         "dieses"    "hatte"      "jeder"      "manches"     "seid"   
    "damit"      "dir"       "hattest"    "jedes"      "mehr"        "sein"   

  Columns 7 through 8

    "seine"      "welcher"
    "seinem"     "welches"
    "seinen"     "wenn"   
    "seiner"     "wer"    
    "seines"     "werde"  
    "sich"       "werden" 
    "sie"        "weshalb"
    "sind"       "wie"    
    "so"         "wieder" 
    "um"         "wieso"  
    "und"        "wir"    
    "uns"        "wirst"  
    "unter"      "wo"     
    "vom"        "während"
    "von"        "zu"     
    "vor"        "zum"    
    "war"        "zur"    
    "waren"      "über"   
    "warst"      ""       
    "warum"      ""       
    "was"        ""       
    "weil"       ""       
    "welche"     ""       
    "welchem"    ""       
    "welchen"    ""       

Input Arguments

collapse all

Stop word language, specified as one of the following:

  • 'en' – English

  • 'ja' – Japanese

  • 'de' – German

For more information about language support in Text Analytics Toolbox™, see Language Considerations.

More About

collapse all

Language Considerations

The stopWords and removeStopWords functions support English, Japanese, and German stop words only.

To remove stop words from other languages, use removeWords and specify your own stop words to remove.

Introduced in R2017b