bagOfWords

bag-of-words モデル

このページをすべて展開する

説明

bag-of-words モデル (用語頻度カウンターとも呼ばれる) は、コレクションの各文書内で単語が出現する回数を記録します。

bagOfWords は、テキストを単語に分割しません。トークン化された文書の配列を作成するには、tokenizedDocument を参照してください。

作成

構文

bag = bagOfWords

bag = bagOfWords(documents)

bag = bagOfWords(uniqueWords,counts)

説明

bag = bagOfWords は、空の bag-of-words モデルを作成します。

bag = bagOfWords(documents) は、documents に出現する単語をカウントし、bag-of-words モデルを返します。

例

bag = bagOfWords(uniqueWords,counts) は、uniqueWords 内の単語と counts 内の対応する頻度カウントを使用して、bag-of-words モデルを作成します。

例

入力引数

すべて展開する

`documents` — 入力文書
`tokenizedDocument` 配列 | string 配列 | 文字ベクトルの cell 配列

入力文書。tokenizedDocument 配列、単語の string 配列、または文字ベクトルの cell 配列として指定します。documents は、tokenizedDocument 配列でない場合、各要素が単語である単一の文書を表す行ベクトルでなければなりません。複数の文書を指定するには、tokenizedDocument 配列を使用します。

`uniqueWords` — 一意の単語リスト
string ベクトル | 文字ベクトルの cell 配列

一意の単語リスト。string ベクトルまたは文字ベクトルの cell 配列として指定します。uniqueWords に <missing> が含まれている場合、関数は欠損値を無視します。uniqueWords のサイズは 1 行 V 列でなければなりません。ここで、V は counts の列数です。

例: ["an" "example" "list"]

データ型: string | cell

`counts` — 単語の頻度カウント
非負の整数の行列

uniqueWords に対応する単語の頻度カウント。非負の整数の行列として指定します。値 counts(i,j) は、単語 uniqueWords(j) が i 番目の文書に出現する回数に対応します。

counts には numel(uniqueWords) 個の列がなければなりません。

プロパティ

すべて展開する

`Counts` — 文書ごとの単語カウント
スパース行列

文書ごとの単語カウント。スパース行列として指定します。

`NumDocuments` — 可視にする文書の数
非負の整数

可視にする文書の数。非負の整数として指定します。

`NumWords` — モデル内の一意の単語の数
非負の整数

モデル内の一意の単語の数。非負の整数として指定します。

`Vocabulary` — モデル内の一意の単語
string ベクトル

モデル内の一意の単語。string ベクトルとして指定します。

データ型: string

オブジェクト関数

`encode`	Encode documents as matrix of word or n-gram counts
`tfidf`	単語頻度-逆文書頻度 (tf-idf) 行列
`topkwords`	Most important words in bag-of-words model or LDA topic
`addDocument`	bag-of-words モデルまたは bag-of-n-grams モデルに文書を追加する
`removeDocument`	Remove documents from bag-of-words or bag-of-n-grams model
`removeEmptyDocuments`	Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
`removeWords`	文書または bag-of-words モデルからの選択単語の削除
`removeInfrequentWords`	bag-of-words モデルからカウント数の少ない単語を削除する
`join`	Combine multiple bag-of-words or bag-of-n-grams models
`wordcloud`	Create word cloud chart from text, bag-of-words model, bag-of-n-grams model, or LDA model

例

すべて折りたたむ

bag-of-words モデルの作成

ライブスクリプトを開く

サンプルデータを読み込みます。ファイル sonnetsPreprocessed.txt には、シェイクスピアのソネット集の前処理されたバージョンが格納されています。ファイルには、1 行に 1 つのソネットが含まれ、単語がスペースで区切られています。sonnetsPreprocessed.txt からテキストを抽出し、テキストを改行文字で文書に分割した後、文書をトークン化します。

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

bagOfWords を使用して bag-of-words モデルを作成します。

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

上位 10 語とその合計カウントを表示します。

tbl = topkwords(bag,10)

tbl=10×2 table
     Word      Count
    _______    _____

    "thy"       281 
    "thou"      234 
    "love"      162 
    "thee"      161 
    "doth"       88 
    "mine"       63 
    "shall"      59 
    "eyes"       56 
    "sweet"      55 
    "time"       53

一意の単語およびカウントからの bag-of-words モデルの作成

ライブスクリプトを開く

一意の単語の string 配列および単語カウントの行列を使用して、bag-of-words モデルを作成します。

uniqueWords = ["a" "an" "another" "example" "final" "sentence" "third"];
counts = [ ...
    1 2 0 1 0 1 0;
    0 0 3 1 0 4 0;
    1 0 0 5 0 3 1;
    1 0 0 1 7 0 0];
bag = bagOfWords(uniqueWords,counts)

bag = 
  bagOfWords with properties:

        NumWords: 7
          Counts: [4×7 double]
      Vocabulary: ["a"    "an"    "another"    "example"    "final"    "sentence"    "third"]
    NumDocuments: 4

ファイルデータストアを使用した複数ファイルからのテキストのインポート

ライブスクリプトを開く

テキストデータが 1 つのフォルダー内の複数のファイルに含まれている場合、ファイルデータストアを使用してテキストデータを MATLAB にインポートできます。

この例のソネットテキストファイル用のファイルデータストアを作成します。例のソネット集のファイル名は "exampleSonnetN.txt" です。ここで、N はソネットの番号です。読み取り関数を extractFileText に指定します。

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

空の bag-of-words モデルを作成します。

bag = bagOfWords

bag = 
  bagOfWords with properties:

        NumWords: 0
          Counts: []
      Vocabulary: [1×0 string]
    NumDocuments: 0

データストア内のファイルをループ処理して、各ファイルを読み取ります。各ファイルのテキストをトークン化し、文書を bag に追加します。

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

更新された bag-of-words モデルを表示します。

bag

bag = 
  bagOfWords with properties:

        NumWords: 276
          Counts: [4×276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    "by"    …    ] (1×276 string)
    NumDocuments: 4

bag-of-words モデルからのストップワードの削除

ライブスクリプトを開く

removeWords にストップワードのリストを入力して、bag-of-words モデルからストップワードを削除します。ストップワードは、"a"、"the"、"in" などの単語で、解析前にテキストから削除されるのが一般的です。

documents = tokenizedDocument([
    "an example of a short sentence" 
    "a second short sentence"]);
bag = bagOfWords(documents);
newBag = removeWords(bag,stopWords)

newBag = 
  bagOfWords with properties:

        NumWords: 4
          Counts: [2×4 double]
      Vocabulary: ["example"    "short"    "sentence"    "second"]
    NumDocuments: 2

bag-of-words モデルの最頻出単語

ライブスクリプトを開く

bag-of-words モデルの最頻出単語の table を作成します。

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

bagOfWords を使用して bag-of-words モデルを作成します。

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

上位 5 つの単語を見つけます。

T = topkwords(bag);

モデル内の上位 20 語を見つけます。

k = 20;
T = topkwords(bag,k)

T=20×2 table
      Word      Count
    ________    _____

    "thy"        281 
    "thou"       234 
    "love"       162 
    "thee"       161 
    "doth"        88 
    "mine"        63 
    "shall"       59 
    "eyes"        56 
    "sweet"       55 
    "time"        53 
    "beauty"      52 
    "nor"         52 
    "art"         51 
    "yet"         51 
    "o"           50 
    "heart"       50 
      ⋮

tf-idf 行列の作成

ライブスクリプトを開く

bag-of-words モデルから単語頻度-逆文書頻度 (tf-idf) 行列を作成します。

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

bagOfWords を使用して bag-of-words モデルを作成します。

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

tf-idf 行列を作成します。最初の 10 個の行と列を表示します。

M = tfidf(bag);
full(M(1:10,1:10))

ans = 10×10

    3.6507    4.3438    2.7344    3.6507    4.3438    2.2644    3.2452    3.8918    2.4720    2.5520
         0         0         0         0         0    4.5287         0         0         0         0
         0         0         0         0         0         0         0         0         0    2.5520
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0    2.5520
         0         0    2.7344         0         0         0         0         0         0         0

bag-of-words モデルからのワードクラウドの作成

ライブスクリプトを開く

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

bagOfWords を使用して bag-of-words モデルを作成します。

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    "contracted"    …    ]
        NumWords: 3092
    NumDocuments: 154

ワードクラウドを使用して bag-of-words モデルを可視化します。

figure
wordcloud(bag);

Figure contains an object of type wordcloud.

bag-of-words モデルの並列作成

ライブスクリプトを開く

テキストデータが 1 つのフォルダー内の複数のファイルに含まれている場合、テキストデータをインポートし、parfor を使用して bag-of-words モデルを並列で作成できます。Parallel Computing Toolbox™ がインストールされている場合、parfor ループは並列実行されます。それ以外の場合は、逐次実行されます。join を使用して、bag-of-words モデルの配列を 1 つのモデルに結合します。

ファイル名のリストを作成します。例のソネット集のファイル名は "exampleSonnetN.txt" です。ここで、N はソネットの番号です。

filenames = [
    "exampleSonnet1.txt"
    "exampleSonnet2.txt"
    "exampleSonnet3.txt"
    "exampleSonnet4.txt"];

ファイルのコレクションから bag-of-words モデルを作成します。空の bag-of-words モデルを初期化し、ファイルをループ処理して各ファイルの bag-of-words モデルを作成します。

bag = bagOfWords;

numFiles = numel(filenames);
parfor i = 1:numFiles
    filename = filenames(i);
    
    textData = extractFileText(filename);
    document = tokenizedDocument(textData);
    bag(i) = bagOfWords(document);
end

join を使用して bag-of-words モデルを結合します。

bag = join(bag)

bag = 
  bagOfWords with properties:

        NumWords: 276
          Counts: [4×276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    "by"    …    ] (1×276 string)
    NumDocuments: 4

ヒント

ホールドアウトされたテストセットを作業に使用する場合は、bagOfWords を使用する前にテキストデータを分割します。そうしないと、bag-of-words モデルによって解析に偏りが生じる可能性があります。

バージョン履歴

R2017b で導入

参考

bagOfWords

説明

作成

構文

説明

入力引数

documents — 入力文書 tokenizedDocument 配列 | string 配列 | 文字ベクトルの cell 配列

uniqueWords — 一意の単語リスト string ベクトル | 文字ベクトルの cell 配列

counts — 単語の頻度カウント 非負の整数の行列

プロパティ

Counts — 文書ごとの単語カウント スパース行列

NumDocuments — 可視にする文書の数 非負の整数

NumWords — モデル内の一意の単語の数 非負の整数

Vocabulary — モデル内の一意の単語 string ベクトル

オブジェクト関数

例

bag-of-words モデルの作成

一意の単語およびカウントからの bag-of-words モデルの作成

ファイル データストアを使用した複数ファイルからのテキストのインポート

bag-of-words モデルからのストップ ワードの削除

bag-of-words モデルの最頻出単語

tf-idf 行列の作成

bag-of-words モデルからのワード クラウドの作成

bag-of-words モデルの並列作成

ヒント

バージョン履歴

参考

トピック

`documents` — 入力文書
`tokenizedDocument` 配列 | string 配列 | 文字ベクトルの cell 配列

`uniqueWords` — 一意の単語リスト
string ベクトル | 文字ベクトルの cell 配列

`counts` — 単語の頻度カウント
非負の整数の行列

`Counts` — 文書ごとの単語カウント
スパース行列

`NumDocuments` — 可視にする文書の数
非負の整数

`NumWords` — モデル内の一意の単語の数
非負の整数

`Vocabulary` — モデル内の一意の単語
string ベクトル

ファイルデータストアを使用した複数ファイルからのテキストのインポート

bag-of-words モデルからのストップワードの削除

bag-of-words モデルからのワードクラウドの作成