解析用のテキストデータの準備

この例では、解析のためにテキストデータをクリーニングおよび前処理する関数を作成する方法を示します。

テキストデータは大きくなる可能性があり、統計解析に悪影響を与える多くのノイズが含まれる可能性があります。たとえば、テキストデータには次のような内容が含まれます。

大文字と小文字のバリエーション (たとえば "new" と "New")
語形のバリエーション (たとえば "walk" と "walking")
ノイズを付加する単語 (たとえば "the" や "of" などのストップワード)
句読点と特殊文字
HTML および XML のタグ

次のワードクラウドは、工場レポートの生テキストデータに適用された単語頻度解析と、同じテキストデータの前処理済みバージョンに適用された単語頻度解析を示しています。

テキストデータの読み込みと抽出

サンプルデータを読み込みます。ファイル factoryReports.csv には、各イベントの説明テキストとカテゴリカルラベルを含む工場レポートが格納されています。

filename = "factoryReports.csv";
data = readtable(filename,'TextType','string');

フィールド Description からテキストデータを抽出し、フィールド Category からラベルデータを抽出します。

textData = data.Description;
labels = data.Category;
textData(1:10)

ans = 10×1 string
    "Items are occasionally getting stuck in the scanner spools."
    "Loud rattling and banging sounds are coming from assembler pistons."
    "There are cuts to the power when starting the plant."
    "Fried capacitors in the assembler."
    "Mixer tripped the fuses."
    "Burst pipe in the constructing agent is spraying coolant."
    "A fuse is blown in the mixer."
    "Things continue to tumble off of the belt."
    "Falling items from the conveyor belt."
    "The scanner reel is split, it will soon begin to curve."

トークン化された文書の作成

トークン化された文書の配列を作成します。

cleanedDocuments = tokenizedDocument(textData);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    10 tokens: Items are occasionally getting stuck in the scanner spools .
    11 tokens: Loud rattling and banging sounds are coming from assembler pistons .
    11 tokens: There are cuts to the power when starting the plant .
     6 tokens: Fried capacitors in the assembler .
     5 tokens: Mixer tripped the fuses .
    10 tokens: Burst pipe in the constructing agent is spraying coolant .
     8 tokens: A fuse is blown in the mixer .
     9 tokens: Things continue to tumble off of the belt .
     7 tokens: Falling items from the conveyor belt .
    13 tokens: The scanner reel is split , it will soon begin to curve .

レンマ化を改善するには、addPartOfSpeechDetails を使用して品詞の詳細を文書に追加します。ストップワードを削除してレンマ化する前に、関数 addPartOfSpeech を使用します。

cleanedDocuments = addPartOfSpeechDetails(cleanedDocuments);

"a"、"and"、"to"、"the" などの単語 (ストップワードと呼ばれる) は、データにノイズを付加する可能性があります。関数 removeStopWords を使用して、ストップワードのリストを削除します。関数 normalizeWords を使用する前に、関数 removeStopWords を使用します。

cleanedDocuments = removeStopWords(cleanedDocuments);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    7 tokens: Items occasionally getting stuck scanner spools .
    8 tokens: Loud rattling banging sounds coming assembler pistons .
    5 tokens: cuts power starting plant .
    4 tokens: Fried capacitors assembler .
    4 tokens: Mixer tripped fuses .
    7 tokens: Burst pipe constructing agent spraying coolant .
    4 tokens: fuse blown mixer .
    6 tokens: Things continue tumble off belt .
    5 tokens: Falling items conveyor belt .
    8 tokens: scanner reel split , soon begin curve .

normalizeWords を使用して単語をレンマ化します。

cleanedDocuments = normalizeWords(cleanedDocuments,'Style','lemma');
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    7 tokens: items occasionally get stuck scanner spool .
    8 tokens: loud rattle bang sound come assembler piston .
    5 tokens: cut power start plant .
    4 tokens: fry capacitor assembler .
    4 tokens: mixer trip fuse .
    7 tokens: burst pipe constructing agent spray coolant .
    4 tokens: fuse blow mixer .
    6 tokens: thing continue tumble off belt .
    5 tokens: fall item conveyor belt .
    8 tokens: scanner reel split , soon begin curve .

文書から句読点を消去します。

cleanedDocuments = erasePunctuation(cleanedDocuments);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    6 tokens: items occasionally get stuck scanner spool
    7 tokens: loud rattle bang sound come assembler piston
    4 tokens: cut power start plant
    3 tokens: fry capacitor assembler
    3 tokens: mixer trip fuse
    6 tokens: burst pipe constructing agent spray coolant
    3 tokens: fuse blow mixer
    5 tokens: thing continue tumble off belt
    4 tokens: fall item conveyor belt
    6 tokens: scanner reel split soon begin curve

2 文字以下の単語と 15 文字以上の単語を削除します。

cleanedDocuments = removeShortWords(cleanedDocuments,2);
cleanedDocuments = removeLongWords(cleanedDocuments,15);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    6 tokens: items occasionally get stuck scanner spool
    7 tokens: loud rattle bang sound come assembler piston
    4 tokens: cut power start plant
    3 tokens: fry capacitor assembler
    3 tokens: mixer trip fuse
    6 tokens: burst pipe constructing agent spray coolant
    3 tokens: fuse blow mixer
    5 tokens: thing continue tumble off belt
    4 tokens: fall item conveyor belt
    6 tokens: scanner reel split soon begin curve

bag-of-words モデルの作成

bag-of-words モデルを作成します。

cleanedBag = bagOfWords(cleanedDocuments)

cleanedBag = 
  bagOfWords with properties:

          Counts: [480×352 double]
      Vocabulary: [1×352 string]
        NumWords: 352
    NumDocuments: 480

bag-of-words モデルで 2 回以上出現しない単語を削除します。

cleanedBag = removeInfrequentWords(cleanedBag,2)

cleanedBag = 
  bagOfWords with properties:

          Counts: [480×163 double]
      Vocabulary: [1×163 string]
        NumWords: 163
    NumDocuments: 480

removeInfrequentWords など、前処理ステップによっては、bag-of-words モデルに空の文書が残ることがあります。前処理後に bag-of-words モデルに空の文書が確実に残らないようにするには、最後のステップとして removeEmptyDocuments を使用します。

bag-of-words モデルから空の文書を削除し、対応するラベルを labels から削除します。

[cleanedBag,idx] = removeEmptyDocuments(cleanedBag);
labels(idx) = [];
cleanedBag

cleanedBag = 
  bagOfWords with properties:

          Counts: [480×163 double]
      Vocabulary: [1×163 string]
        NumWords: 163
    NumDocuments: 480

前処理関数の作成

前処理を実行する関数を作成すると、さまざまなテキストデータのコレクションを同じ方法で準備できるので便利です。たとえば、1 つの関数を使用して、学習データと同じ手順で新しいデータを前処理することができます。

解析に使用できるように、テキストデータをトークン化して前処理する関数を作成します。関数 preprocessText は、以下の手順を実行します。

tokenizedDocument を使用してテキストをトークン化します。
removeStopWords を使用して、ストップワード ("and"、"of"、"the" など) のリストを削除します。
normalizeWords を使用して単語をレンマ化します。
erasePunctuation を使用して句読点を消去します。
removeShortWords を使用して、2 文字以下の単語を削除します。
removeLongWords を使用して、15 文字以上の単語を削除します。

この例の前処理関数 preprocessText を使用して、テキストデータを準備します。

newText = "The sorting machine is making lots of loud noises.";
newDocuments = preprocessText(newText)

newDocuments = 
  tokenizedDocument:

   6 tokens: sorting machine make lot loud noise

生データとの比較

前処理されたデータと生データを比較します。

rawDocuments = tokenizedDocument(textData);
rawBag = bagOfWords(rawDocuments)

rawBag = 
  bagOfWords with properties:

          Counts: [480×555 double]
      Vocabulary: [1×555 string]
        NumWords: 555
    NumDocuments: 480

データの削減量を計算します。

numWordsCleaned = cleanedBag.NumWords;
numWordsRaw = rawBag.NumWords;
reduction = 1 - numWordsCleaned/numWordsRaw

reduction = 0.7063

ワードクラウドを使用して 2 つの bag-of-words モデルを可視化することにより、生データとクリーニングされたデータを比較します。

figure
subplot(1,2,1)
wordcloud(rawBag);
title("Raw Data")
subplot(1,2,2)
wordcloud(cleanedBag);
title("Cleaned Data")

前処理関数

関数 preprocessText は、以下の手順を順番に実行します。

tokenizedDocument を使用してテキストをトークン化します。
removeStopWords を使用して、ストップワード ("and"、"of"、"the" など) のリストを削除します。
normalizeWords を使用して単語をレンマ化します。
erasePunctuation を使用して句読点を消去します。
removeShortWords を使用して、2 文字以下の単語を削除します。
removeLongWords を使用して、15 文字以上の単語を削除します。

function documents = preprocessText(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Remove a list of stop words then lemmatize the words. To improve
% lemmatization, first use addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

参考

解析用のテキスト データの準備

テキスト データの読み込みと抽出