Main Content

深層学習を使用した単語単位のテキスト生成

この例では、深層学習 LSTM ネットワークに学習させ、単語単位でテキストを生成する方法を説明します。

単語単位のテキスト生成のために深層学習ネットワークに学習させるには、単語のシーケンスの中から次の単語を予測するように sequence-to-sequence LSTM ネットワークに学習させます。次の単語を予測するようにネットワークに学習させるには、1 タイム ステップ分シフトした入力シーケンスになるように応答を指定します。

この例では、Web サイトからテキストを読み取ります。HTML コードの読み取りと解析を行って関連テキストを抽出し、カスタム ミニバッチ データストア documentGenerationDatastore を使用して、ドキュメントをシーケンス データのミニバッチとしてネットワークに入力します。データストアは、ドキュメントを数値の単語インデックスのシーケンスに変換します。深層学習ネットワークは、単語埋め込み層を含む LSTM ネットワークです。

"ミニバッチ データストア" とは、バッチ単位でのデータの読み取りをサポートするデータストアの実装です。ミニバッチ データストアは、深層学習アプリケーションの学習データセット、検証データセット、テスト データセット、および予測データセットのソースとして使用できます。ミニバッチ データストアを使用して、メモリ外のデータを読み取るか、データのバッチを読み取る際に特定の前処理演算を実行します。

関数をカスタマイズして、documentGenerationDatastore.m で指定されたカスタム ミニバッチ データストアをデータに適応させることができます。このファイルは、サポート ファイルとしてこの例に添付されています。このファイルにアクセスするには、例をライブ スクリプトとして開きます。独自のカスタム ミニバッチ データストアを作成する方法を示す例については、カスタム ミニバッチ データストアの開発を参照してください。

学習データの読み込み

学習データを読み込みます。Project Gutenberg の Alice's Adventures in Wonderland by Lewis Carroll から HTML コードを読み取ります。

url = "https://www.gutenberg.org/files/11/11-h/11-h.htm";
code = webread(url);

HTML コードの解析

HTML コードには、<p> (段落) 要素の中に関連テキストが含まれています。htmlTree を使用して HTML コードを解析し、要素名 "p" をもつすべての要素を検索して、関連テキストを抽出します。

tree = htmlTree(code);
selector = "p";
subtrees = findElement(tree,selector);

extractHTMLText を使用して HTML サブツリーからテキスト データを抽出し、最初の 10 段落を表示します。

textData = extractHTMLText(subtrees);
textData(1:10)
ans = 10×1 string
    "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”"
    "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her."
    "There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."
    "In another moment down went Alice after it, never once considering how in the world she was to get out again."
    "The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well."
    "Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled “ORANGE MARMALADE”, but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody underneath, so managed to put it into one of the cupboards as she fell past it."
    "“Well!” thought Alice to herself, “after such a fall as this, I shall think nothing of tumbling down stairs! How brave they’ll all think me at home! Why, I wouldn’t say anything about it, even if I fell off the top of the house!” (Which was very likely true.)"
    "Down, down, down. Would the fall never come to an end? “I wonder how many miles I’ve fallen by this time?” she said aloud. “I must be getting somewhere near the centre of the earth. Let me see: that would be four thousand miles down, I think-” (for, you see, Alice had learnt several things of this sort in her lessons in the schoolroom, and though this was not a very good opportunity for showing off her knowledge, as there was no one to listen to her, still it was good practice to say it over) “-yes, that’s about the right distance-but then I wonder what Latitude or Longitude I’ve got to?” (Alice had no idea what Latitude was, or Longitude either, but thought they were nice grand words to say.)"
    "Presently she began again. “I wonder if I shall fall right through the earth! How funny it’ll seem to come out among the people that walk with their heads downward! The Antipathies, I think-” (she was rather glad there was no one listening, this time, as it didn’t sound at all the right word) “-but I shall have to ask them what the name of the country is, you know. Please, Ma’am, is this New Zealand or Australia?” (and she tried to curtsey as she spoke-fancy curtseying as you’re falling through the air! Do you think you could manage it?) “And what an ignorant little girl she’ll think me for asking! No, it’ll never do to ask: perhaps I shall see it written up somewhere.”"
    "Down, down, down. There was nothing else to do, so Alice soon began talking again. “Dinah’ll miss me very much to-night, I should think!” (Dinah was the cat.) “I hope they’ll remember her saucer of milk at tea-time. Dinah my dear! I wish you were down here with me! There are no mice in the air, I’m afraid, but you might catch a bat, and that’s very like a mouse, you know. But do cats eat bats, I wonder?” And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, “Do cats eat bats? Do cats eat bats?” and sometimes, “Do bats eat cats?” for, you see, as she couldn’t answer either question, it didn’t much matter which way she put it. She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, “Now, Dinah, tell me the truth: did you ever eat a bat?” when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over."

空の段落を削除して、残った最初の 10 段落を表示します。

textData(textData == "") = [];
textData(1:10)
ans = 10×1 string
    "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”"
    "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her."
    "There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."
    "In another moment down went Alice after it, never once considering how in the world she was to get out again."
    "The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well."
    "Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled “ORANGE MARMALADE”, but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody underneath, so managed to put it into one of the cupboards as she fell past it."
    "“Well!” thought Alice to herself, “after such a fall as this, I shall think nothing of tumbling down stairs! How brave they’ll all think me at home! Why, I wouldn’t say anything about it, even if I fell off the top of the house!” (Which was very likely true.)"
    "Down, down, down. Would the fall never come to an end? “I wonder how many miles I’ve fallen by this time?” she said aloud. “I must be getting somewhere near the centre of the earth. Let me see: that would be four thousand miles down, I think-” (for, you see, Alice had learnt several things of this sort in her lessons in the schoolroom, and though this was not a very good opportunity for showing off her knowledge, as there was no one to listen to her, still it was good practice to say it over) “-yes, that’s about the right distance-but then I wonder what Latitude or Longitude I’ve got to?” (Alice had no idea what Latitude was, or Longitude either, but thought they were nice grand words to say.)"
    "Presently she began again. “I wonder if I shall fall right through the earth! How funny it’ll seem to come out among the people that walk with their heads downward! The Antipathies, I think-” (she was rather glad there was no one listening, this time, as it didn’t sound at all the right word) “-but I shall have to ask them what the name of the country is, you know. Please, Ma’am, is this New Zealand or Australia?” (and she tried to curtsey as she spoke-fancy curtseying as you’re falling through the air! Do you think you could manage it?) “And what an ignorant little girl she’ll think me for asking! No, it’ll never do to ask: perhaps I shall see it written up somewhere.”"
    "Down, down, down. There was nothing else to do, so Alice soon began talking again. “Dinah’ll miss me very much to-night, I should think!” (Dinah was the cat.) “I hope they’ll remember her saucer of milk at tea-time. Dinah my dear! I wish you were down here with me! There are no mice in the air, I’m afraid, but you might catch a bat, and that’s very like a mouse, you know. But do cats eat bats, I wonder?” And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, “Do cats eat bats? Do cats eat bats?” and sometimes, “Do bats eat cats?” for, you see, as she couldn’t answer either question, it didn’t much matter which way she put it. She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, “Now, Dinah, tell me the truth: did you ever eat a bat?” when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over."

テキスト データをワード クラウドで可視化します。

figure
wordcloud(textData);
title("Alice's Adventures in Wonderland")

学習用データの準備

documentGenerationDatastore を使用して、学習用のデータを含むデータストアを作成します。予測子については、このデータストアは単語符号化を使用してドキュメントを単語インデックスのシーケンスに変換します。各ドキュメントの最初の単語インデックスは、"テキスト開始" トークンに対応します。"テキスト開始" トークンは、文字列 "startOfText" で与えられます。応答については、データストアは 1 でシフトされた単語のカテゴリカル シーケンスを返します。

tokenizedDocument を使用してテキスト データをトークン化します。

documents = tokenizedDocument(textData);

トークン化されたドキュメントを使用して、ドキュメント生成データストアを作成します。

ds = documentGenerationDatastore(documents);

シーケンスに追加するパディングの量を減らすために、データストアのドキュメントをシーケンス長で並べ替えます。

ds = sort(ds);

LSTM ネットワークの作成と学習

LSTM ネットワーク アーキテクチャを定義します。シーケンス データをネットワークに入力するために、シーケンス入力層を含め、入力サイズを 1 に設定します。次に、100 次元の単語埋め込み層と、同じ数の単語の単語符号化を含めます。次に、LSTM 層を含めて、非表示サイズを 100 に指定します。最後に、クラスの数と同じサイズの全結合層や、ソフトマックス層と分類層を追加します。クラス数は、ボキャブラリの単語数に、"テキスト終了" クラス用に追加のクラスを足した数です。

inputSize = 1;
embeddingDimension = 100;
numWords = numel(ds.Encoding.Vocabulary);
numClasses = numWords + 1;

layers = [ 
    sequenceInputLayer(inputSize)
    wordEmbeddingLayer(embeddingDimension,numWords)
    lstmLayer(100)
    dropoutLayer(0.2)
    fullyConnectedLayer(numClasses)
    softmaxLayer
    classificationLayer];

学習オプションを指定します。ソルバーを 'adam' に指定します。学習率 0.01 で 300 エポック学習させます。ミニバッチのサイズを 32 に設定します。データをシーケンス長で並べ替えられた状態に保つために、'Shuffle' オプションを 'never' に設定します。学習の進行状況を監視するには、'Plots' オプションを 'training-progress' に設定します。詳細出力を表示しないようにするには、'Verbose'false に設定します。

options = trainingOptions('adam', ...
    'MaxEpochs',300, ...
    'InitialLearnRate',0.01, ...
    'MiniBatchSize',32, ...
    'Shuffle','never', ...
    'Plots','training-progress', ...
    'Verbose',false);

trainNetwork を使用してネットワークに学習させます。

net = trainNetwork(ds,layers,options);

新しいテキストの生成

学習データに含まれるテキストの最初の単語に基づいて、確率分布から単語をサンプリングし、テキストの最初の単語を生成します。学習済み LSTM ネットワークを使用して、生成されたテキストの現在のシーケンスに基づいて次のタイム ステップを予測し、残りの単語を生成します。ネットワークが "テキスト終結" 単語を予測するまで、単語を 1 つずつ生成し続けます。

ネットワークを使用して最初の予測を行うには、"テキスト開始" トークンを表すインデックスを入力します。関数 word2ind と、ドキュメント データストアで使用されている単語符号化を使用して、インデックスを検索します。

enc = ds.Encoding;
wordIndex = word2ind(enc,"startOfText")
wordIndex = 1

残りの予測については、ネットワークの予測スコアに従って、次の単語をサンプリングします。予測スコアは次の単語の確率分布を表します。ネットワークの出力層のクラス名で与えられたボキャブラリから単語をサンプリングします。

vocabulary = string(net.Layers(end).Classes);

predictAndUpdateState を使用して単語単位の予測を行います。予測ごとに、前の単語のインデックスを入力します。ネットワークがテキスト終結単語を予測するか、生成されたテキストの長さが 500 文字になったら予測を停止します。データの大規模なコレクション、長いシーケンス、または大規模ネットワークの場合は、通常、GPU での予測の方が CPU での予測より計算時間が短縮されます。そうでない場合、通常、CPU での予測の計算の方が高速です。1 タイム ステップの予測には、CPU を使用します。予測に CPU を使用するには、predictAndUpdateState'ExecutionEnvironment' オプションを 'cpu' に設定します。

generatedText = "";
maxLength = 500;
while strlength(generatedText) < maxLength
    % Predict the next word scores.
    [net,wordScores] = predictAndUpdateState(net,wordIndex,'ExecutionEnvironment','cpu');
    
    % Sample the next word.
    newWord = datasample(vocabulary,1,'Weights',wordScores);
    
    % Stop predicting at the end of text.
    if newWord == "EndOfText"
        break
    end
    
    % Add the word to the generated text.
    generatedText = generatedText + " " + newWord;
    
    % Find the word index for the next input.
    wordIndex = word2ind(enc,newWord);
end

生成過程は各予測の間に空白文字を追加するため、一部の句読点文字が前後に不要な空白を伴って出現することになります。該当する句読点文字の前後の空白を削除し、生成されたテキストを再構成します。

指定された句読点文字の前に出現する空白を削除します。

punctuationCharacters = ["." "," "’" ")" ":" "?" "!"];
generatedText = replace(generatedText," " + punctuationCharacters,punctuationCharacters);

指定された句読文字の後に出現する空白を削除します。

punctuationCharacters = ["(" "‘"];
generatedText = replace(generatedText,punctuationCharacters + " ",punctuationCharacters)
generatedText = 
" “ Just about as much right, ” said the Duchess, “ and that’s all the least, ” said the Hatter. “ Fetch me to my witness at the shepherd heart of him."

複数のテキストを生成するには、resetState を使用して生成間のネットワークの状態をリセットします。

net = resetState(net);

参考

(Text Analytics Toolbox) | (Text Analytics Toolbox) | (Text Analytics Toolbox) | | | | | (Text Analytics Toolbox) | (Text Analytics Toolbox) | (Text Analytics Toolbox) | (Text Analytics Toolbox)

関連するトピック