addDocument

bag-of-words モデルまたは bag-of-n-grams モデルに文書を追加する

ページ内をすべて折りたたむ

構文

newBag = addDocument(bag,documents)

説明

newBag = addDocument(bag,documents) は、bag-of-words モデルまたは bag-of-n-grams モデル bag に documents を追加します。

例

すべて折りたたむ

bag-of-words モデルに文書を追加する

ライブスクリプトを開く

トークン化された文書の配列から bag-of-words モデルを作成します。

documents = tokenizedDocument([
    "an example of a short sentence"
    "a second short sentence"]);
bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 7
          Counts: [2×7 double]
      Vocabulary: ["an"    "example"    "of"    "a"    "short"    "sentence"    "second"]
    NumDocuments: 2

文書をトークン化してもう一つ配列を作成し、同じ bag-of-words モデルに追加します。

documents = tokenizedDocument([ 
    "a third example of a short sentence" 
    "another short sentence"]);
newBag = addDocument(bag,documents)

newBag = 
  bagOfWords with properties:

        NumWords: 9
          Counts: [4×9 double]
      Vocabulary: ["an"    "example"    "of"    "a"    "short"    "sentence"    "second"    "third"    "another"]
    NumDocuments: 4

ファイルデータストアを使用した複数ファイルからのテキストのインポート

ライブスクリプトを開く

テキストデータが 1 つのフォルダー内の複数のファイルに含まれている場合、ファイルデータストアを使用してテキストデータを MATLAB にインポートできます。

この例のソネットテキストファイル用のファイルデータストアを作成します。例のソネット集のファイル名は "exampleSonnetN.txt" です。ここで、N はソネットの番号です。読み取り関数を extractFileText に指定します。

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

空の bag-of-words モデルを作成します。

bag = bagOfWords

bag = 
  bagOfWords with properties:

        NumWords: 0
          Counts: []
      Vocabulary: [1×0 string]
    NumDocuments: 0

データストア内のファイルをループ処理して、各ファイルを読み取ります。各ファイルのテキストをトークン化し、文書を bag に追加します。

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

更新された bag-of-words モデルを表示します。

bag

bag = 
  bagOfWords with properties:

        NumWords: 276
          Counts: [4×276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    "by"    …    ] (1×276 string)
    NumDocuments: 4

入力引数

すべて折りたたむ

`bag` — 入力の bag-of-words モデルまたは bag-of-n-grams モデル
`bagOfWords` オブジェクト | `bagOfNgrams` オブジェクト

入力の bag-of-words モデルまたは bag-of-n-grams モデル。bagOfWords オブジェクトまたは bagOfNgrams オブジェクトとして指定します。

`documents` — 入力文書
`tokenizedDocument` 配列 | string 配列 | 文字ベクトルの cell 配列

入力文書。tokenizedDocument 配列、単語の string 配列、または文字ベクトルの cell 配列として指定します。documents は、tokenizedDocument 配列でない場合、各要素が単語である単一の文書を表す行ベクトルでなければなりません。複数の文書を指定するには、tokenizedDocument 配列を使用します。

出力引数

すべて折りたたむ

`newBag` — 出力モデル
`bagOfWords` オブジェクト | `bagOfNgrams` オブジェクト

出力モデル。bagOfWords オブジェクトまたは bagOfNgrams オブジェクトとして返されます。newBag の型は bag の型と同じです。

バージョン履歴

R2017b で導入

参考

bagOfWords | bagOfNgrams | removeDocument | removeEmptyDocuments | tokenizedDocument

addDocument

構文

説明

例

bag-of-words モデルに文書を追加する

ファイル データストアを使用した複数ファイルからのテキストのインポート

入力引数

bag — 入力の bag-of-words モデルまたは bag-of-n-grams モデル bagOfWords オブジェクト | bagOfNgrams オブジェクト

documents — 入力文書 tokenizedDocument 配列 | string 配列 | 文字ベクトルの cell 配列

出力引数

newBag — 出力モデル bagOfWords オブジェクト | bagOfNgrams オブジェクト

バージョン履歴

参考

トピック

ファイルデータストアを使用した複数ファイルからのテキストのインポート

`bag` — 入力の bag-of-words モデルまたは bag-of-n-grams モデル
`bagOfWords` オブジェクト | `bagOfNgrams` オブジェクト

`documents` — 入力文書
`tokenizedDocument` 配列 | string 配列 | 文字ベクトルの cell 配列

`newBag` — 出力モデル
`bagOfWords` オブジェクト | `bagOfNgrams` オブジェクト