センチメント分類器の学習

この例では、ポジティブセンチメントやネガティブセンチメントを含む単語のアノテーション付きリストと事前学習済みの単語埋め込みを使用して、センチメント分析用の分類器に学習させる方法を示します。

事前学習済みの単語埋め込みは、このワークフローでいくつかの役割を果たします。単語を数値ベクトルに変換し、分類器の基礎を形成します。次に、この分類器を使用し、ベクトル表現を使用して他の単語のセンチメントを予測し、それらの分類を使用して一部のテキストのセンチメントを計算できます。センチメント分類器の学習と使用には、次の 4 つの手順があります。

事前学習済みの単語埋め込みの読み込み。
ポジティブな単語とネガティブな単語をリストした意見辞書の読み込み。
ポジティブな単語とネガティブな単語の単語ベクトルを使用したセンチメント分類器の学習。
一部のテキスト内の単語の平均センチメントスコアの計算。

事前学習済みの単語埋め込みの読み込み

単語埋め込みは、ボキャブラリ内の単語を数値ベクトルにマッピングします。これらの埋め込みは、単語のセマンティックな詳細を取得できるため、互いに類似する単語はベクトルも類似するようになります。また、ベクトル演算を使用して単語間の関係をモデル化します。たとえば、"Rome is to Paris as Italy is to France" (パリに対してのローマは、フランスに対してのイタリアに同じ) という関係は、方程式 $R o m e - I t a l y + F r a n c e \approx P a r i s$ で記述されます。

関数 fastTextWordEmbedding を使用して、事前学習済みの単語埋め込みを読み込みます。この関数には、Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding サポートパッケージが必要です。このサポートパッケージがインストールされていない場合、関数によってダウンロード用リンクが表示されます。

emb = fastTextWordEmbedding;

意見辞書の読み込み

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html [1] の意見辞書 (センチメント辞書とも呼ばれます) からポジティブな単語とネガティブな単語を読み込みます。まず、.rar ファイルから、opinion-lexicon-English という名前のフォルダーにファイルを抽出し、テキストをインポートします。

この例の最後にリストされている関数 readLexicon を使用して、データを読み込みます。出力 data は、単語が格納された変数 Word と、カテゴリカルセンチメントラベル (Positive または Negative) が格納された Label を含む table です。

data = readLexicon;

ポジティブとしてラベル付けされた最初のいくつかの単語を表示します。

idx = data.Label == "Positive";
head(data(idx,:))

ans=8×2 table
        Word         Label  
    ____________    ________

    "a+"            Positive
    "abound"        Positive
    "abounds"       Positive
    "abundance"     Positive
    "abundant"      Positive
    "accessable"    Positive
    "accessible"    Positive
    "acclaim"       Positive

ネガティブとしてラベル付けされた最初のいくつかの単語を表示します。

idx = data.Label == "Negative";
head(data(idx,:))

ans=8×2 table
        Word          Label  
    _____________    ________

    "2-faced"        Negative
    "2-faces"        Negative
    "abnormal"       Negative
    "abolish"        Negative
    "abominable"     Negative
    "abominably"     Negative
    "abominate"      Negative
    "abomination"    Negative

学習用データの準備

センチメント分類器に学習させるには、事前学習済みの単語埋め込み emb を使用して、単語を単語ベクトルに変換します。最初に、単語埋め込み emb に含まれていない単語を削除します。

idx = ~isVocabularyWord(emb,data.Word);
data(idx,:) = [];

ランダムに選んだ単語の 10% をテスト用に残しておきます。

numWords = size(data,1);
cvp = cvpartition(numWords,'HoldOut',0.1);
dataTrain = data(training(cvp),:);
dataTest = data(test(cvp),:);

word2vec を使用して、学習データ内の単語を単語ベクトルに変換します。

wordsTrain = dataTrain.Word;
XTrain = word2vec(emb,wordsTrain);
YTrain = dataTrain.Label;

センチメント分類器の学習

単語ベクトルをポジティブカテゴリとネガティブカテゴリに分類するサポートベクターマシン (SVM) 分類器に学習させます。

mdl = fitcsvm(XTrain,YTrain);

テスト分類器

word2vec を使用して、テストデータ内の単語を単語ベクトルに変換します。

wordsTest = dataTest.Word;
XTest = word2vec(emb,wordsTest);
YTest = dataTest.Label;

テスト単語ベクトルのセンチメントラベルを予測します。

[YPred,scores] = predict(mdl,XTest);

混同行列として分類精度を可視化します。

figure
confusionchart(YTest,YPred);

ワードクラウドで分類を可視化します。予測スコアに対応するワードサイズで、ポジティブセンチメントを含む単語とネガティブセンチメントを含む単語をワードクラウドにプロットします。

figure
subplot(1,2,1)
idx = YPred == "Positive";
wordcloud(wordsTest(idx),scores(idx,1));
title("Predicted Positive Sentiment")

subplot(1,2,2)
wordcloud(wordsTest(~idx),scores(~idx,2));
title("Predicted Negative Sentiment")

テキストコレクションのセンチメントの計算

ソーシャルメディアの更新など、一部のテキストのセンチメントを計算するには、テキスト内の各単語のセンチメントスコアを予測してから、センチメントスコアの平均を取ります。

filename = "weekendUpdates.xlsx";
tbl = readtable(filename,'TextType','string');
textData = tbl.TextData;
textData(1:10)

ans = 10×1 string array
    "Happy anniversary! ❤ Next stop: Paris! ✈ #vacation"
    "Haha, BBQ on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"
    "getting ready for Saturday night 🍕 #yum #weekend 😎"
    "Say it with me - I NEED A #VACATION!!! ☹"
    "😎 Chilling 😎 at home for the first time in ages…This is the life! 👍 #weekend"
    "My last #weekend before the exam 😢 👎."
    "can’t believe my #vacation is over 😢 so unfair"
    "Can’t wait for tennis this #weekend 🎾🍓🥂 😀"
    "I had so much fun! 😀😀😀 Best trip EVER! 😀😀😀 #vacation #weekend"
    "Hot weather and air con broke in car 😢 #sweaty #roadtrip #vacation"

解析に使用できるように、テキストデータをトークン化して前処理する関数を作成します。例の最後にリストされている関数 preprocessText は、以下の手順を順番に実行します。

tokenizedDocument を使用してテキストをトークン化します。
erasePunctuation を使用して句読点を消去します。
removeStopWords を使用して、ストップワード ("and"、"of"、"the" など) を削除します。
lower を使用して小文字に変換します。

前処理関数 preprocessText を使用して、テキストデータを準備します。このステップの実行には数分かかる場合があります。

documents = preprocessText(textData);

単語埋め込み emb に含まれていない単語を文書から削除します。

idx = ~isVocabularyWord(emb,documents.Vocabulary);
documents = removeWords(documents,idx);

センチメント分類器が新しいテキストに対してどの程度適切に汎化を行ったかを可視化するには、学習データではなくテキストに出現する単語のセンチメントを分類し、それらをワードクラウドで可視化します。ワードクラウドを使用して、分類器が期待どおりに動作することを手動で確認します。

words = documents.Vocabulary;
words(ismember(words,wordsTrain)) = [];

vec = word2vec(emb,words);
[YPred,scores] = predict(mdl,vec);

figure
subplot(1,2,1)
idx = YPred == "Positive";
wordcloud(words(idx),scores(idx,1));
title("Predicted Positive Sentiment")

subplot(1,2,2)
wordcloud(words(~idx),scores(~idx,2));
title("Predicted Negative Sentiment")

特定のテキストのセンチメントを計算するには、そのテキスト内の各単語のセンチメントスコアを計算し、平均センチメントスコアを計算します。

更新の平均センチメントスコアを計算します。各文書について、単語を単語ベクトルに変換し、単語ベクトルのセンチメントスコアを予測し、スコアから事後への変換関数を使用してスコアを変換し、平均センチメントスコアを計算します。

for i = 1:numel(documents)
    words = string(documents(i));
    vec = word2vec(emb,words);
    [~,scores] = predict(mdl,vec);
    sentimentScore(i) = mean(scores(:,1));
end

予測されたセンチメントスコアをテキストデータと一緒に表示します。0 より大きいスコアはポジティブセンチメントに対応し、0 未満のスコアはネガティブセンチメントに対応し、0 に近いスコアはニュートラルセンチメントに対応します。

table(sentimentScore', textData)

ans=50×2 table
       Var1                                                                textData                                                          
    __________    ___________________________________________________________________________________________________________________________

        1.8382    "Happy anniversary! ❤ Next stop: Paris! ✈ #vacation"                                                                       
         1.294    "Haha, BBQ on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"                                                           
        1.0922    "getting ready for Saturday night 🍕 #yum #weekend 😎"                                                                     
      0.094709    "Say it with me - I NEED A #VACATION!!! ☹"                                                                                 
        1.4073    "😎 Chilling 😎 at home for the first time in ages…This is the life! 👍 #weekend"                                          
       -0.8356    "My last #weekend before the exam 😢 👎."                                                                                  
       -1.3556    "can’t believe my #vacation is over 😢 so unfair"                                                                          
        1.4312    "Can’t wait for tennis this #weekend 🎾🍓🥂 😀"                                                                            
        3.0458    "I had so much fun! 😀😀😀 Best trip EVER! 😀😀😀 #vacation #weekend"                                                      
      -0.39243    "Hot weather and air con broke in car 😢 #sweaty #roadtrip #vacation"                                                      
        0.8028    "🎉 Check the out-of-office crew, we are officially ON #VACATION!! 😎"                                                     
       0.38217    "Well that wasn’t how I expected this #weekend to go 👎 Total washout!! 😢"                                                
          3.03    "So excited for my bestie to visit this #weekend! 😀 ❤ 😀"                                                                 
        2.3849    "Who needs a #vacation when the weather is this good ☀ 😎"                                                                 
    -0.0006176    "I love meetings in summer that run into the weekend! Wait that was sarcasm. Bring on the aircon apocalypse! 👎 ☹ #weekend"
       0.52992    "You know we all worked hard for this! We totes deserve this 🎉 #vacation 🎉 Ibiza ain’t gonna know what hit em 😎"        
      ⋮

センチメント辞書読み取り関数

この関数は、センチメント辞書からポジティブな単語とネガティブな単語を読み取り、table を返します。table には、変数 Word と変数 Label が格納されます。ここで、Label には、各単語のセンチメントに対応するカテゴリ値 Positive および Negative が格納されます。

function data = readLexicon

% Read positive words
fidPositive = fopen(fullfile('opinion-lexicon-English','positive-words.txt'));
C = textscan(fidPositive,'%s','CommentStyle',';');
wordsPositive = string(C{1});

% Read negative words
fidNegative = fopen(fullfile('opinion-lexicon-English','negative-words.txt'));
C = textscan(fidNegative,'%s','CommentStyle',';');
wordsNegative = string(C{1});
fclose all;

% Create table of labeled words
words = [wordsPositive;wordsNegative];
labels = categorical(nan(numel(words),1));
labels(1:numel(wordsPositive)) = "Positive";
labels(numel(wordsPositive)+1:end) = "Negative";

data = table(words,labels,'VariableNames',{'Word','Label'});

end

前処理関数

関数 preprocessText は、以下の手順を実行します。

tokenizedDocument を使用してテキストをトークン化します。
erasePunctuation を使用して句読点を消去します。
removeStopWords を使用して、ストップワード ("and"、"of"、"the" など) を削除します。
lower を使用して小文字に変換します。

function documents = preprocessText(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Convert to lowercase.
documents = lower(documents);

end

参考文献

Hu, Minqing, and Bing Liu. "Mining and summarizing customer reviews." In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168-177. ACM, 2004.

参考

センチメント分類器の学習

事前学習済みの単語埋め込みの読み込み

意見辞書の読み込み

学習用データの準備

センチメント分類器の学習

テスト分類器

テキスト コレクションのセンチメントの計算

センチメント辞書読み取り関数

前処理関数

参考文献

参考

関連するトピック

テキストコレクションのセンチメントの計算