stopWords

ストップワードのリスト

構文

words = stopWords

words = stopWords('Language',language)

説明

"a"、"and"、"to"、"the" などの単語 (ストップワードと呼ばれる) は、データにノイズを付加する可能性があります。ストップワードリストを使用し、単語のカスタムリストを作成して解析の前に削除できるようにします。

文書の言語の詳細を使用して、トークン化された文書からストップワードの既定のリストを削除するには、removeStopWords を使用します。トークン化された文書から単語のカスタムリストを削除するには、removeWords を使用します。

この関数は、英語、日本語、ドイツ語、および韓国語のストップワードリストを返します。

words = stopWords は、解析前に文書から削除できる一般的な英語の単語から成る string 配列を返します。

例

words = stopWords('Language',language) は、ストップワードの言語を指定します。

例

すべて折りたたむ

文書からのストップワードのカスタムリストの削除

ライブスクリプトを開く

文書の言語の詳細を使用してストップワードの既定のリストを削除するには、removeStopWordsを使用します。

ストップワードのカスタムリストを削除するには、関数 removeWords を使用します。関数 stopWords によって返されるストップワードリストを開始点として使用できます。

サンプルデータを読み込みます。ファイル sonnetsPreprocessed.txt には、シェイクスピアのソネット集の前処理されたバージョンが格納されています。ファイルには、1 行に 1 つのソネットが含まれ、単語がスペースで区切られています。sonnetsPreprocessed.txt からテキストを抽出し、テキストを改行文字で文書に分割した後、文書をトークン化します。

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

最初のいくつかの文書を表示します。

documents(1:5)

ans = 
  5×1 tokenizedDocument:

    70 tokens: fairest creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory thou contracted thine own bright eyes feedst thy lights flame selfsubstantial fuel making famine abundance lies thy self thy foe thy sweet self cruel thou art worlds fresh ornament herald gaudy spring thine own bud buriest thy content tender churl makst waste niggarding pity world else glutton eat worlds due grave thee
    71 tokens: forty winters shall besiege thy brow dig deep trenches thy beautys field thy youths proud livery gazed tatterd weed small worth held asked thy beauty lies treasure thy lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd thy beautys thou couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made thou art old thy blood warm thou feelst cold
    65 tokens: look thy glass tell face thou viewest time face form another whose fresh repair thou renewest thou dost beguile world unbless mother fair whose uneard womb disdains tillage thy husbandry fond tomb selflove stop posterity thou art thy mothers glass thee calls back lovely april prime thou windows thine age shalt despite wrinkles thy golden time thou live rememberd die single thine image dies thee
    71 tokens: unthrifty loveliness why dost thou spend upon thy self thy beautys legacy natures bequest gives nothing doth lend frank lends free beauteous niggard why dost thou abuse bounteous largess thee give profitless usurer why dost thou great sum sums yet canst live traffic thy self alone thou thy self thy sweet self dost deceive nature calls thee gone acceptable audit canst thou leave thy unused beauty tombed thee lives th executor
    61 tokens: hours gentle work frame lovely gaze every eye doth dwell play tyrants same unfair fairly doth excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet

関数 stopWords の出力で始まるストップワードのリストを作成します。

customStopWords = [stopWords "thy" "thee" "thou" "dost" "doth"];

文書からカスタムストップワードを削除し、最初のいくつかの文書を表示します。

documents = removeWords(documents,customStopWords);
documents(1:5)

ans = 
  5×1 tokenizedDocument:

    62 tokens: fairest creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory contracted thine own bright eyes feedst lights flame selfsubstantial fuel making famine abundance lies self foe sweet self cruel art worlds fresh ornament herald gaudy spring thine own bud buriest content tender churl makst waste niggarding pity world else glutton eat worlds due grave
    61 tokens: forty winters shall besiege brow dig deep trenches beautys field youths proud livery gazed tatterd weed small worth held asked beauty lies treasure lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd beautys couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made art old blood warm feelst cold
    52 tokens: look glass tell face viewest time face form another whose fresh repair renewest beguile world unbless mother fair whose uneard womb disdains tillage husbandry fond tomb selflove stop posterity art mothers glass calls back lovely april prime windows thine age shalt despite wrinkles golden time live rememberd die single thine image dies
    52 tokens: unthrifty loveliness why spend upon self beautys legacy natures bequest gives nothing lend frank lends free beauteous niggard why abuse bounteous largess give profitless usurer why great sum sums yet canst live traffic self alone self sweet self deceive nature calls gone acceptable audit canst leave unused beauty tombed lives th executor
    59 tokens: hours gentle work frame lovely gaze every eye dwell play tyrants same unfair fairly excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet

英語のストップワードのリスト

ライブスクリプトを開く

関数 stopWords を使用して、英語のストップワードのリストを取得します。読みやすくするために、出力の形状を変更します。

words = stopWords;
reshape(words,[25 9])

ans = 25×9 string
    "a"          "but"         "during"     "hows"       "it's"     "said"         "this"       "we’re"      "who’ve"  
    "about"      "by"          "each"       "however"    "it’s"     "says"         "those"      "we've"      "whove"   
    "above"      "can"         "either"     "i"          "its"      "see"          "through"    "we’ve"      "will"    
    "across"     "can't"       "for"        "i'd"        "let's"    "she"          "to"         "weve"       "with"    
    "after"      "can’t"       "from"       "i’d"        "let’s"    "she'd"        "too"        "were"       "within"  
    "all"        "cant"        "given"      "i'll"       "lets"     "she’d"        "towards"    "what"       "without" 
    "along"      "cannot"      "had"        "i’ll"       "may"      "shed"         "under"      "what's"     "won't"   
    "also"       "could"       "has"        "i'm"        "me"       "she'll"       "until"      "what’s"     "won’t"   
    "am"         "couldn't"    "have"       "i’m"        "more"     "she’ll"       "us"         "whats"      "would"   
    "an"         "couldn’t"    "having"     "im"         "most"     "shell"        "use"        "when"       "wouldn't"
    "and"        "couldnt"     "he"         "i've"       "much"     "should"       "used"       "when's"     "wouldn’t"
    "any"        "did"         "he'd"       "i’ve"       "must"     "since"        "uses"       "when’s"     "you"     
    "are"        "didn't"      "he’d"       "ive"        "my"       "so"           "using"      "whens"      "you'd"   
    "aren't"     "didn’t"      "hed"        "if"         "no"       "some"         "very"       "where"      "you’d"   
    "aren’t"     "didnt"       "he'll"      "in"         "not"      "such"         "want"       "whether"    "youd"    
    "arent"      "do"          "he’ll"      "instead"    "now"      "than"         "was"        "which"      "you'll"  
    "as"         "does"        "her"        "into"       "of"       "that"         "wasn't"     "while"      "you’ll"  
    "at"         "doesn't"     "here"       "is"         "on"       "the"          "wasn’t"     "who"        "youll"   
    "be"         "doesn’t"     "hers"       "isn't"      "one"      "their"        "wasnt"      "who'll"     "you're"  
    "because"    "doesnt"      "him"        "isn’t"      "only"     "them"         "we"         "who’ll"     "you’re"  
    "been"       "doing"       "himself"    "isnt"       "or"       "then"         "we'd"       "wholl"      "youre"   
    "before"     "done"        "his"        "it"         "other"    "there"        "we’d"       "who's"      "you've"  
    "being"      "don't"       "how"        "it'll"      "our"      "therefore"    "we'll"      "who’s"      "you’ve"  
    "between"    "don’t"       "how's"      "it’ll"      "out"      "these"        "we’ll"      "whos"       "youve"   
    "both"       "dont"        "how’s"      "itll"       "over"     "they"         "we're"      "who've"     "your"

日本語のストップワードのリスト

ライブスクリプトを開く

関数 stopWords を使用して、日本語のストップワードのリストを取得します。読みやすくするために、出力の形状を変更します。

words = stopWords('Language','ja');
reshape([words strings(1,8)],[35 11])

ans = 35×11 string
    "あそこ"      "さらい"      "なかば"      "下"    "今"    "地"      "列"    "秋"      "本当"     "う"       "どう" 
    "あたり"      "さん"       "なに"       "字"    "部"    "員"      "事"    "冬"      "確か"     "え"       "な"   
    "あちら"      "しかた"      "など"       "年"    "課"    "線"      "士"    "一"      "時点"     "お"       "ない" 
    "あっち"      "しよう"      "なん"       "月"    "係"    "点"      "台"    "二"      "全部"     "か"       "なり" 
    "あと"       "すか"       "はじめ"      "日"    "外"    "書"      "集"    "三"      "関係"     "が"       "なる" 
    "あな"       "ずつ"       "はず"       "時"    "類"    "品"      "様"    "四"      "近く"     "こそ"     "に"   
    "あなた"      "すね"       "はるか"      "分"    "達"    "力"      "所"    "五"      "方法"     "この"     "ね"   
    "あれ"       "すべて"      "ひと"       "秒"    "気"    "法"      "歴"    "六"      "我々"     "さ"       "の"   
    "いくつ"      "ぜんぶ"      "ひとつ"      "週"    "室"    "感"      "器"    "七"      "違い"     "さえ"     "ので" 
    "いつ"       "そう"       "ふく"       "火"    "口"    "作"      "名"    "八"      "多く"     "し"       "のに" 
    "いま"       "そこ"       "ぶり"       "水"    "誰"    "元"      "情"    "九"      "扱い"     "しか"     "は"   
    "いや"       "そちら"      "べつ"       "木"    "用"    "手"      "連"    "十"      "新た"     "する"     "ばかり"
    "いろいろ"    "そっち"      "へん"       "金"    "界"    "数"      "毎"    "百"      "その後"    "ず"       "へ"   
    "うち"       "そで"       "ぺん"       "土"    "会"    "彼"      "式"    "千"      "半ば"     "せる"     "ほど" 
    "おおまか"    "それ"       "ほう"       "国"    "首"    "彼女"    "簿"    "万"      "結局"     "そして"    "ます" 
    "おまえ"      "それぞれ"    "ほか"       "都"    "男"    "子"      "回"    "億"      "様々"     "その"     "ませ" 
    "おれ"       "それなり"    "まさ"       "道"    "女"    "内"      "匹"    "兆"      "以前"     "た"       "また" 
    "がい"       "たくさん"    "まし"       "府"    "別"    "楽"      "個"    "下記"    "以後"     "たい"     "まで" 
    "かく"       "たち"       "まとも"      "県"    "話"    "喜"      "席"    "上記"    "以降"     "ただ"     "も"   
    "かたち"      "たび"       "まま"       "市"    "私"    "怒"      "束"    "時間"    "未満"     "だ"       "や"   
    "かやの"      "ため"       "みたい"      "区"    "屋"    "哀"      "歳"    "今回"    "以上"     "だけ"     "やら" 
    "から"       "だめ"       "みつ"       "町"    "店"    "輪"      "目"    "前回"    "以下"     "だに"     "よ"   
    "がら"       "ちゃ"       "みなさん"    "村"    "家"    "頃"      "通"    "場合"    "幾つ"     "だの"     "より" 
    "きた"       "ちゃん"      "みんな"      "各"    "場"    "化"      "面"    "一つ"    "毎日"     "ち"       "れる" 
    "くせ"       "てん"       "もと"       "第"    "等"    "境"      "円"    "年生"    "自体"     "って"     "わ"   
    "ここ"       "とおり"      "もの"       "方"    "見"    "俺"      "玉"    "自分"    "向こう"    "て"       "を"   
    "こっち"      "とき"       "もん"       "何"    "際"    "奴"      "枚"    "ヶ所"    "何人"     "で"       "ん"   
    "こと"       "どこ"       "やつ"       "的"    "観"    "高"      "前"    "ヵ所"    "手段"     "でし"     ""     
    "ごと"       "どこか"      "よう"       "度"    "段"    "校"      "後"    "カ所"    "同じ"     "です"     ""     
    "こちら"      "ところ"      "よそ"       "文"    "略"    "婦"      "左"    "箇所"    "感じ"     "では"     ""     
      ⋮

ドイツ語のストップワードのリスト

ライブスクリプトを開く

関数 stopWords を使用して、ドイツ語のストップワードのリストを取得します。読みやすくするために、出力の形状を変更します。

words = stopWords('Language','de');
reshape([words strings(1,7)],[25 8])

ans = 25×8 string
    "ab"         "dann"      "doch"       "hattet"     "jene"        "mein"       "seine"      "welcher"
    "aber"       "das"       "du"         "her"        "jenem"       "meine"      "seinem"     "welches"
    "alle"       "dass"      "durch"      "hin"        "jenen"       "meinem"     "seinen"     "wenn"   
    "allem"      "daß"       "ein"        "hätte"      "jener"       "meinen"     "seiner"     "wer"    
    "allen"      "dein"      "eine"       "hättest"    "jenes"       "meiner"     "seines"     "werde"  
    "aller"      "deine"     "einem"      "hättet"     "kann"        "meines"     "sich"       "werden" 
    "alles"      "deinem"    "einen"      "ich"        "kannst"      "mich"       "sie"        "weshalb"
    "als"        "deiner"    "einer"      "ihm"        "kein"        "mir"        "sind"       "wie"    
    "also"       "deines"    "eines"      "ihn"        "keine"       "mit"        "so"         "wieder" 
    "am"         "dem"       "er"         "ihr"        "keinem"      "muss"       "um"         "wieso"  
    "an"         "den"       "es"         "ihre"       "keinen"      "musst"      "und"        "wir"    
    "andere"     "denn"      "euch"       "ihrem"      "keiner"      "musste"     "uns"        "wirst"  
    "anderem"    "der"       "euer"       "ihren"      "keines"      "muß"        "unter"      "wo"     
    "anderen"    "derer"     "eure"       "ihrer"      "können"      "müssen"     "vom"        "während"
    "anderer"    "des"       "eurem"      "ihres"      "könnte"      "müssten"    "von"        "zu"     
    "anderes"    "dessen"    "euren"      "im"         "könnten"     "nach"       "vor"        "zum"    
    "auch"       "dich"      "eures"      "in"         "könntest"    "nicht"      "war"        "zur"    
    "auf"        "die"       "für"        "ins"        "ließ"        "nichts"     "waren"      "über"   
    "aus"        "dies"      "ganz"       "ist"        "man"         "noch"       "warst"      ""       
    "bei"        "diese"     "gar"        "ja"         "manche"      "nun"        "warum"      ""       
    "bin"        "diesem"    "habe"       "jede"       "manchem"     "nur"        "was"        ""       
    "bis"        "diesen"    "haben"      "jedem"      "manchen"     "ob"         "weil"       ""       
    "bist"       "dieser"    "hat"        "jeden"      "mancher"     "oder"       "welche"     ""       
    "da"         "dieses"    "hatte"      "jeder"      "manches"     "seid"       "welchem"    ""       
    "damit"      "dir"       "hattest"    "jedes"      "mehr"        "sein"       "welchen"    ""

入力引数

すべて折りたたむ

`language` — ストップワードの言語
`'en'` (既定値) | `'ja'` | `'de'` | `'ko'`

ストップワードの言語。次のいずれかとして指定します。

'en' – 英語
'ja' – 日本語
'de' – ドイツ語
'ko' – 韓国語

Text Analytics Toolbox™ での言語サポートの詳細については、言語に関する考慮事項を参照してください。

詳細

すべて折りたたむ

言語に関する考慮事項

関数 stopWords および関数 removeStopWords は、英語、日本語、ドイツ語、および韓国語のストップワードのみをサポートします。

他の言語からストップワードを削除するには、removeWords を使用して、削除する独自のストップワードを指定します。

バージョン履歴

R2017b で導入

参考

stopWords

構文

説明

例

文書からのストップ ワードのカスタム リストの削除

英語のストップ ワードのリスト

日本語のストップ ワードのリスト

ドイツ語のストップ ワードのリスト

入力引数

language — ストップ ワードの言語 'en' (既定値) | 'ja' | 'de' | 'ko'

詳細

言語に関する考慮事項

バージョン履歴

参考

トピック

文書からのストップワードのカスタムリストの削除

英語のストップワードのリスト

日本語のストップワードのリスト

ドイツ語のストップワードのリスト

`language` — ストップワードの言語
`'en'` (既定値) | `'ja'` | `'de'` | `'ko'`