分類用の単純なテキストモデルの作成

この例では、bag-of-words モデルを使用して、単語の頻度カウントを単純なテキスト分類器に学習させる方法を示します。

単語の頻度カウントを予測子として使用する単純な分類モデルを作成できます。この例では、単純な分類モデルに学習させ、説明テキストを使用して工場レポートのカテゴリを予測します。

テキストデータの読み込みと抽出

サンプルデータを読み込みます。ファイル factoryReports.csv には、各レポートの説明テキストとカテゴリカルラベルを含む工場レポートが格納されています。

filename = "factoryReports.csv";
data = readtable(filename,'TextType','string');
head(data)

ans=8×5 table
                                 Description                                       Category          Urgency          Resolution         Cost 
    _____________________________________________________________________    ____________________    ________    ____________________    _____

    "Items are occasionally getting stuck in the scanner spools."            "Mechanical Failure"    "Medium"    "Readjust Machine"         45
    "Loud rattling and banging sounds are coming from assembler pistons."    "Mechanical Failure"    "Medium"    "Readjust Machine"         35
    "There are cuts to the power when starting the plant."                   "Electronic Failure"    "High"      "Full Replacement"      16200
    "Fried capacitors in the assembler."                                     "Electronic Failure"    "High"      "Replace Components"      352
    "Mixer tripped the fuses."                                               "Electronic Failure"    "Low"       "Add to Watch List"        55
    "Burst pipe in the constructing agent is spraying coolant."              "Leak"                  "High"      "Replace Components"      371
    "A fuse is blown in the mixer."                                          "Electronic Failure"    "Low"       "Replace Components"      441
    "Things continue to tumble off of the belt."                             "Mechanical Failure"    "Low"       "Readjust Machine"         38

table の Category 列のラベルを categorical に変換し、ヒストグラムを使用してデータ内のクラスの分布を表示します。

data.Category = categorical(data.Category);
figure
histogram(data.Category)
xlabel("Class")
ylabel("Frequency")
title("Class Distribution")

データを学習区画と、ホールドアウトテストセットに分割します。ホールドアウトの割合を 10% に指定します。

cvp = cvpartition(data.Category,'Holdout',0.1);
dataTrain = data(cvp.training,:);
dataTest = data(cvp.test,:);

table からテキストデータとラベルを抽出します。

textDataTrain = dataTrain.Description;
textDataTest = dataTest.Description;
YTrain = dataTrain.Category;
YTest = dataTest.Category;

解析用のテキストデータの準備

解析に使用できるように、テキストデータをトークン化して前処理する関数を作成します。関数 preprocessText は、以下の手順を順番に実行します。

tokenizedDocument を使用してテキストをトークン化します。
removeStopWords を使用して、ストップワード ("and"、"of"、"the" など) のリストを削除します。
normalizeWords を使用して単語をレンマ化します。
erasePunctuation を使用して句読点を消去します。
removeShortWords を使用して、2 文字以下の単語を削除します。
removeLongWords を使用して、15 文字以上の単語を削除します。

この例の前処理関数 preprocessText を使用して、テキストデータを準備します。

documents = preprocessText(textDataTrain);
documents(1:5)

ans = 
  5×1 tokenizedDocument:

    6 tokens: items occasionally get stuck scanner spool
    7 tokens: loud rattle bang sound come assembler piston
    4 tokens: cut power start plant
    3 tokens: fry capacitor assembler
    3 tokens: mixer trip fuse

トークン化された文書から bag-of-words モデルを作成します。

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [432×336 double]
      Vocabulary: [1×336 string]
        NumWords: 336
    NumDocuments: 432

合計で 2 回以上出現しない単語を bag-of-words モデルから削除します。単語を含まない文書すべてを bag-of-words モデルから削除し、ラベル内の対応するエントリを削除します。

bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
bag

bag = 
  bagOfWords with properties:

          Counts: [432×155 double]
      Vocabulary: [1×155 string]
        NumWords: 155
    NumDocuments: 432

教師あり分類器の学習

bag-of-words モデルからの単語頻度カウントおよびラベルを使用して、教師あり分類モデルに学習させます。

fitcecoc を使用して、マルチクラス線形分類モデルに学習させます。bag-of-words モデルの Counts プロパティを予測子に指定し、イベントタイプラベルを応答に指定します。学習器を線形に指定します。この学習器は、スパースデータ入力をサポートします。

XTrain = bag.Counts;
mdl = fitcecoc(XTrain,YTrain,'Learners','linear')

mdl = 
  CompactClassificationECOC
      ResponseName: 'Y'
        ClassNames: [Electronic Failure    Leak    Mechanical Failure    Software Failure]
    ScoreTransform: 'none'
    BinaryLearners: {6×1 cell}
      CodingMatrix: [4×6 double]


  Properties, Methods

良好な当てはめを実現するために、線形学習器のさまざまなパラメーターを指定してみることができます。線形分類学習器テンプレートの詳細については、templateLinearを参照してください。

テスト分類器

学習済みモデルを使用してテストデータのラベルを予測し、分類精度を計算します。分類精度は、モデルが正しく予測するラベルの割合です。

学習データと同じ前処理手順を使用して、テストデータを前処理します。結果のテスト文書を、bag-of-words モデルに従って単語頻度カウントの行列として符号化します。

documentsTest = preprocessText(textDataTest);
XTest = encode(bag,documentsTest);

学習済みモデルを使用してテストデータのラベルを予測し、分類精度を計算します。

YPred = predict(mdl,XTest);
acc = sum(YPred == YTest)/numel(YTest)

acc = 0.8542

新しいデータを使用した予測

新しい工場レポートのイベントタイプを分類します。新しい工場レポートを格納する string 配列を作成します。

str = [
    "Coolant is pooling underneath sorter."
    "Sorter blows fuses at start up."
    "There are some very loud rattling sounds coming from the assembler."];
documentsNew = preprocessText(str);
XNew = encode(bag,documentsNew);
labelsNew = predict(mdl,XNew)

labelsNew = 3×1 categorical
     Leak 
     Electronic Failure 
     Mechanical Failure

例の前処理関数

関数 preprocessText は、以下の手順を順番に実行します。

tokenizedDocument を使用してテキストをトークン化します。
removeStopWords を使用して、ストップワード ("and"、"of"、"the" など) のリストを削除します。
normalizeWords を使用して単語をレンマ化します。
erasePunctuation を使用して句読点を消去します。
removeShortWords を使用して、2 文字以下の単語を削除します。
removeLongWords を使用して、15 文字以上の単語を削除します。

function documents = preprocessText(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Remove a list of stop words then lemmatize the words. To improve
% lemmatization, first use addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

参考

分類用の単純なテキスト モデルの作成

テキスト データの読み込みと抽出

解析用のテキスト データの準備