テキストデータの準備

テキストデータを MATLAB^® にインポートして、解析のために前処理する

Text Analytics Toolbox™ には、装置のログ、ニュースフィード、アンケート、オペレーターレポート、ソーシャルメディアなどのソースから得た生テキストを処理するためのツールが含まれます。これらのツールを使用して、一般的なファイル形式からテキストを抽出し、生テキストを前処理し、個々の単語またはマルチワードフレーズ (n-gram) を抽出し、テキストを数値表現に変換し、統計モデルを構築します。開始方法を示す例については、解析用のテキストデータの準備を参照してください。

Text Analytics Toolbox は、英語、日本語、ドイツ語、および韓国語の言語をサポートしています。Text Analytics Toolbox のほとんどの関数は、他の言語のテキストでも動作します。詳細については、言語に関する考慮事項を参照してください。

ライブエディタータスク

テキストデータの前処理

Preprocess and clean up text data for analysis (R2023a 以降)

関数

すべて展開する

インポートとエクスポート

`extractFileText`	PDF、Microsoft Word、HTML、およびプレーンテキストファイルからのテキストの読み取り
`extractHTMLText`	HTML からのテキストの抽出
`readPDFFormData`	PDF フォームからのデータの読み取り
`pdfinfo`	PDF file information (R2023a 以降)
`writeTextDocument`	テキストファイルへの文書の書き込み

HTML 解析

`htmlTree`	解析された HTML ツリー
`findElement`	HTML ツリー内の要素の検出
`getAttribute`	HTML ツリーのルートノードの HTML 属性の読み取り
`ismissing`	Find HTML trees without values
`string`	Convert parsed HTML tree to string

文書の前処理

`tokenizedDocument`	テキスト解析用のトークン化された文書の配列
`erasePunctuation`	テキストや文書からの句読点の消去
`eraseTags`	テキストからの HTML および XML のタグの消去
`eraseURLs`	テキストからの HTTP および HTTPS の URL の消去
`removeStopWords`	文書からのストップワードの削除
`removeShortWords`	文書または bag-of-words モデルからの短い単語の削除
`removeLongWords`	Remove long words from documents or bag-of-words model
`removeWords`	文書または bag-of-words モデルからの選択単語の削除
`normalizeWords`	単語のステミングまたはレンマ化
`replaceWords`	Replace words in documents
`replaceNgrams`	Replace n-grams in documents
`splitSentences`	Split text into sentences
`splitParagraphs`	Split text into paragraphs (R2023a 以降)
`stopWords`	ストップワードのリスト
`decodeHTMLEntities`	HTML および XML のエンティティから文字への変換
`lower`	小文字への文書の変換
`upper`	大文字への文書の変換

トークンの詳細

`context`	Search documents for word or n-gram occurrences in context
`tokenDetails`	Details of tokens in tokenized document array
`addSentenceDetails`	Add sentence numbers to documents
`addPartOfSpeechDetails`	Add part-of-speech tags to documents
`addLemmaDetails`	Add lemma forms of tokens to documents
`addLanguageDetails`	Add language identifiers to documents
`addEntityDetails`	Add entity tags to documents
`addDependencyDetails`	文書への文法的依存関係の詳細の追加 (R2022b 以降)
`addTypeDetails`	Add token type details to documents
`splitSentences`	Split text into sentences
`splitParagraphs`	Split text into paragraphs (R2023a 以降)
`corpusLanguage`	テキストの言語の検出
`abbreviations`	一般的な略語の table
`topLevelDomains`	トップレベルドメインのリスト

単語と n-gram のカウント

`bagOfWords`	bag-of-words モデル
`bagOfNgrams`	bag-of-n-grams モデル
`addDocument`	Add documents to bag-of-words or bag-of-n-grams model
`removeDocument`	Remove documents from bag-of-words or bag-of-n-grams model
`removeInfrequentWords`	bag-of-words モデルからカウント数の少ない単語を削除する
`removeInfrequentNgrams`	Remove infrequently seen n-grams from bag-of-n-grams model
`removeNgrams`	Remove n-grams from bag-of-n-grams model
`removeEmptyDocuments`	Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
`topkwords`	Most important words in bag-of-words model or LDA topic
`topkngrams`	Most frequent n-grams
`encode`	Encode documents as matrix of word or n-gram counts
`tfidf`	単語頻度-逆文書頻度 (tf-idf) 行列
`join`	Combine multiple bag-of-words or bag-of-n-grams models

スペル修正と編集距離

`correctSpelling`	Correct spelling of words (R2020a 以降)
`editDistance`	Find edit distance between two strings or documents
`editDistanceSearcher`	Edit distance nearest neighbor searcher
`knnsearch`	編集距離による最近傍の検出
`rangesearch`	Find nearest neighbors by edit distance range
`splitGraphemes`	Split string into graphemes

文書の操作と変換

`docfun`	Apply function to words in documents
`containsWords`	Check if word is member of documents (R2022b 以降)
`containsNgrams`	Check if n-gram is member of documents (R2022a 以降)
`contains`	Check if pattern is substring in documents (R2022b 以降)
`plus`	Append documents
`replace`	Replace substrings in documents
`regexprep`	Replace text in words of documents using regular expression
`doclength`	文書配列内の文書の長さ
`doc2cell`	文書から string ベクトルの cell 配列への変換
`joinWords`	単語連結による文書から string への変換
`string`	スカラー文書から string ベクトルへの変換

Unicode

`textanalytics.unicode.nfc`	Unicode composed normalized form (NFC) (R2022b 以降)
`textanalytics.unicode.nfd`	Unicode decomposed normalized form (NFD) (R2021a 以降)
`textanalytics.unicode.nfkc`	Unicode compatibility composed normalized form (NFKC) (R2022b 以降)
`textanalytics.unicode.nfkd`	Unicode compatibility decomposed normalized form (NFKD) (R2022b 以降)
`textanalytics.unicode.UTF32`	Unicode UTF-32 string representation (R2021a 以降)
`characterCategories`	Unicode character categories (R2021a 以降)
`hex`	UTF-32 表現から 16 進数値への変換 (R2021a 以降)
`string`	UTF-32 表現から string への変換 (R2021a 以降)

トピック

インポート

ファイルからのテキストデータの抽出
この例では、テキスト、HTML、Microsoft® Word、PDF、CSV、および Microsoft Excel® ファイルからテキストデータを抽出し、解析のために MATLAB® にインポートする方法を示します。
HTML の解析およびテキストコンテンツの抽出
この例では、HTML コードを解析し、特定の要素からテキストコンテンツを抽出する方法を示します。
テキスト解析用のデータセット
さまざまなテキスト解析タスク用のデータセットを確認する。

前処理

Preprocess Text Data in Live Editor
Explore text preprocessing techniques using the Preprocess Text Data Live Editor task.
解析用のテキストデータの準備
この例では、解析のためにテキストデータをクリーニングおよび前処理する関数を作成する方法を示します。
絵文字を含むテキストデータの解析
この例では、絵文字を含むテキストデータを解析する方法を示します。
文書のスペルの修正
この例では、Hunspell を使用して文書のスペルを修正する方法を示します。
Create Extension Dictionary for Spelling Correction
This example shows how to create a Hunspell extension dictionary for spelling correction.
Create Custom Spelling Correction Function Using Edit Distance Searchers
This example shows how to correct spelling using edit distance searchers and a vocabulary of known words.
文法的依存関係の解析を使用した文構造の解析
この例では、文法的依存関係の解析を使用して文から情報を抽出する方法を示します。

言語サポート

言語に関する考慮事項
他の言語向けの、Text Analytics Toolbox の機能の使用に関する情報。
日本語言語サポート
Text Analytics Toolbox での日本語サポートに関する情報。
日本語のテキストデータの解析
この例では、トピックモデルを使用して、日本語のテキストデータをインポート、準備、および解析する方法を示します。
German Language Support
Information on German support in Text Analytics Toolbox.
Analyze German Text Data
This example shows how to import, prepare, and analyze German text data using a topic model.