extractFileText

PDF、Microsoft Word、HTML、およびプレーンテキストファイルからのテキストの読み取り

ページ内をすべて折りたたむ

構文

str = extractFileText(filename)

str = extractFileText(filename,Name,Value)

説明

str = extractFileText(filename) は、ファイルからテキストデータを string として読み取ります。

例

str = extractFileText(filename,Name,Value) は、1 つ以上の名前と値のペアの引数を使用して、追加のオプションを指定します。

例

すべて折りたたむ

テキストファイルからのテキストデータの抽出

ライブスクリプトを開く

extractFileText を使用して sonnets.txt からテキストを抽出します。ファイル sonnets.txt には、シェイクスピアのソネット集がプレーンテキストとして格納されています。

str = extractFileText("sonnets.txt");

最初のソネットを表示します。

i = strfind(str,"I");
ii = strfind(str,"II");
start = i(1);
fin = ii(1);
extractBetween(str,start,fin-1)

ans = 
    "I
     
       From fairest creatures we desire increase,
       That thereby beauty's rose might never die,
       But as the riper should by time decease,
       His tender heir might bear his memory:
       But thou, contracted to thine own bright eyes,
       Feed'st thy light's flame with self-substantial fuel,
       Making a famine where abundance lies,
       Thy self thy foe, to thy sweet self too cruel:
       Thou that art now the world's fresh ornament,
       And only herald to the gaudy spring,
       Within thine own bud buriest thy content,
       And tender churl mak'st waste in niggarding:
         Pity the world, or else this glutton be,
         To eat the world's due, by the grave and thee.
     
       "

PDF からのテキストデータの抽出

ライブスクリプトを開く

extractFileText を使用して exampleSonnets.pdf からテキストを抽出します。ファイル exampleSonnets.pdf には、シェイクスピアのソネット集が PDF ファイルとして格納されています。

str = extractFileText("exampleSonnets.pdf");

2 番目のソネットを表示します。

ii = strfind(str,"II");
iii = strfind(str,"III");
start = ii(1);
fin = iii(1);
extractBetween(str,start,fin-1)

ans = 
    "II 
      
       When forty winters shall besiege thy brow, 
       And dig deep trenches in thy beauty's field, 
       Thy youth's proud livery so gazed on now, 
       Will be a tatter'd weed of small worth held: 
       Then being asked, where all thy beauty lies, 
       Where all the treasure of thy lusty days; 
       To say, within thine own deep sunken eyes, 
       Were an all-eating shame, and thriftless praise. 
       How much more praise deserv'd thy beauty's use, 
       If thou couldst answer 'This fair child of mine 
       Shall sum my count, and make my old excuse,' 
       Proving his beauty by succession thine! 
         This were to be new made when thou art old, 
         And see thy blood warm when thou feel'st it cold. 
      
       "

PDF ファイルの 3、5、7 ページからテキストを抽出します。

pages = [3 5 7];
str = extractFileText("exampleSonnets.pdf", ...
    'Pages',pages);

10 番目のソネットを表示します。

x = strfind(str,"X");
xi = strfind(str,"XI");
start = x(1);
fin = xi(1);
extractBetween(str,start,fin-1)

ans = 
    "X 
      
       Is it for fear to wet a widow's eye, 
       That thou consum'st thy self in single life? 
       Ah! if thou issueless shalt hap to die, 
       The world will wail thee like a makeless wife; 
       The world will be thy widow and still weep 
       That thou no form of thee hast left behind, 
       When every private widow well may keep 
       By children's eyes, her husband's shape in mind: 
       Look! what an unthrift in the world doth spend 
       Shifts but his place, for still the world enjoys it; 
       But beauty's waste hath in the world an end, 
       And kept unused the user so destroys it. 
         No love toward others in that bosom sits 
         That on himself such murd'rous shame commits. 
      
       X 
      
       For shame! deny that thou bear'st love to any, 
       Who for thy self art so unprovident. 
       Grant, if thou wilt, thou art belov'd of many, 
       But that thou none lov'st is most evident: 
       For thou art so possess'd with murderous hate, 
       That 'gainst thy self thou stick'st not to conspire, 
       Seeking that beauteous roof to ruinate 
       Which to repair should be thy chief desire. 
     
     
      
       "

ファイルデータストアを使用した複数ファイルからのテキストのインポート

ライブスクリプトを開く

テキストデータが 1 つのフォルダー内の複数のファイルに含まれている場合、ファイルデータストアを使用してテキストデータを MATLAB にインポートできます。

この例のソネットテキストファイル用のファイルデータストアを作成します。例のソネット集のファイル名は "exampleSonnetN.txt" です。ここで、N はソネットの番号です。読み取り関数を extractFileText に指定します。

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

空の bag-of-words モデルを作成します。

bag = bagOfWords

bag = 
  bagOfWords with properties:

        NumWords: 0
          Counts: []
      Vocabulary: [1×0 string]
    NumDocuments: 0

データストア内のファイルをループ処理して、各ファイルを読み取ります。各ファイルのテキストをトークン化し、文書を bag に追加します。

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

更新された bag-of-words モデルを表示します。

bag

bag = 
  bagOfWords with properties:

        NumWords: 276
          Counts: [4×276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    "by"    …    ] (1×276 string)
    NumDocuments: 4

HTML からのテキストの抽出

ライブスクリプトを開く

HTML コードからテキストデータを直接抽出するには、HTML コードを string として指定して extractHTMLText を使用します。

code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)

str = 
    "THE SONNETS
     
     by William Shakespeare"

入力引数

すべて折りたたむ

`filename` — ファイルの名前
string スカラー | 文字ベクトル | 文字ベクトルを含む 1 行 1 列の cell 配列

ファイルの名前。string スカラー、文字ベクトル、または文字ベクトルを含む 1 行 1 列の cell 配列として指定します。

データ型: string | char | cell

名前と値の引数

すべて折りたたむ

オプションの引数のペアを Name1=Value1,...,NameN=ValueN として指定します。ここで、Name は引数名で、Value は対応する値です。名前と値の引数は他の引数の後に指定しなければなりませんが、ペアの順序は重要ではありません。

R2021a より前では、コンマを使用して名前と値をそれぞれ区切り、Name を引用符で囲みます。

例: 'Pages',[1 3 5] は、PDF ファイルから 1、3、5 ページを読み取ることを指定します。

`Encoding` — 文字エンコード
`'auto'` (既定値) | `'UTF-8'` | `'ISO-8859-1'` | `'windows-1251'` | `'windows-1252'` | ...

使用する文字エンコード。'Encoding' および文字ベクトルまたは string スカラーで構成されるコンマ区切りのペアとして指定します。文字ベクトルまたは string スカラーには、次のような標準文字エンコードスキーム名が含まれていなければなりません。

`"Big5"`	`"ISO-8859-1"`	`"windows-874"`
`"Big5-HKSCS"`	`"ISO-8859-2"`	`"windows-949"`
`"CP949"`	`"ISO-8859-3"`	`"windows-1250"`
`"EUC-KR"`	`"ISO-8859-4"`	`"windows-1251"`
`"EUC-JP"`	`"ISO-8859-5"`	`"windows-1252"`
`"EUC-TW"`	`"ISO-8859-6"`	`"windows-1253"`
`"GB18030"`	`"ISO-8859-7"`	`"windows-1254"`
`"GB2312"`	`"ISO-8859-8"`	`"windows-1255"`
`"GBK"`	`"ISO-8859-9"`	`"windows-1256"`
`"IBM866"`	`"ISO-8859-11"`	`"windows-1257"`
`"KOI8-R"`	`"ISO-8859-13"`	`"windows-1258"`
`"KOI8-U"`	`"ISO-8859-15"`	`"US-ASCII"`
	`"Macintosh"`	`"UTF-8"`
	`"Shift_JIS"`

エンコードスキームが指定されなかった場合、関数は使用する文字エンコードに対してヒューリスティックな自動検出を実行します。ヒューリスティックな方法はロケールによって異なります。これらのヒューリスティックな方法が失敗した場合は、明示的に指定しなければなりません。

このオプションは、入力がプレーンテキストファイルの場合にのみ適用されます。

データ型: char | string

`ExtractionMethod` — 抽出方法
`'tree'` (既定値) | `'article'` | `'all-text'`

抽出方法。'ExtractionMethod' と次のいずれかで構成されるコンマ区切りのペアとして指定します。

オプション	説明
`'tree'`	DOM ツリーとテキストコンテンツを解析し、段落のブロックを抽出します。
`'article'`	記事のテキストを検出し、段落のブロックを抽出します。
`'all-text'`	スクリプトと CSS スタイルを除く、HTML 本文のすべてのテキストを抽出します。

このオプションは、HTML ファイルの入力のみをサポートします。

`Password` — PDF ファイルを開くためのパスワード
文字ベクトル | string スカラー

PDF ファイルを開くためのパスワード。'Password' および文字ベクトルまたは string スカラーで構成されるコンマ区切りのペアとして指定します。このオプションは、入力ファイルが PDF の場合にのみ適用されます。

例: 'Password','skroWhtaM'

データ型: char | string

`Pages` — PDF ファイルから読み取るページ
正の整数のベクトル

PDF ファイルから読み取るページ。'Pages' と正の整数のベクトルで構成されるコンマ区切りのペアとして指定します。このオプションは、入力ファイルが PDF ファイルの場合にのみ適用されます。既定では、関数は PDF ファイルからすべてのページを読み取ります。

例: 'Pages',[1 3 5]

ヒント

HTML コードからテキストを直接読み取るには、extractHTMLText を使用します。
テキストファイルの行で区切られたテキストを読み取るには、readlines を使用します。

バージョン履歴

R2017b で導入

すべて展開する

R2020b: `extractFileText` は Microsoft Word 97–2003 バイナリ DOC ファイルからのテキストの抽出をサポートしない

関数 extractFileText を使用して Microsoft^® Word 97–2003 バイナリ DOC ファイルからテキストを抽出するためのサポートは削除されました。Microsoft Word DOCX ファイルは引き続きサポートされます。

Microsoft Word 97–2003 バイナリ DOC ファイルからテキストデータを抽出するには、まずファイルを PDF、Microsoft Word DOCX、HTML、またはプレーンテキストファイルとして保存してから、関数 extractFileText を使用します。

参考

pdfinfo | extractHTMLText | readPDFFormData | writeTextDocument | tokenizedDocument

extractFileText

構文

説明

例

テキスト ファイルからのテキスト データの抽出

PDF からのテキスト データの抽出

ファイル データストアを使用した複数ファイルからのテキストのインポート

HTML からのテキストの抽出

入力引数

filename — ファイルの名前 string スカラー | 文字ベクトル | 文字ベクトルを含む 1 行 1 列の cell 配列

名前と値の引数

Encoding — 文字エンコード 'auto' (既定値) | 'UTF-8' | 'ISO-8859-1' | 'windows-1251' | 'windows-1252' | ...

ExtractionMethod — 抽出方法 'tree' (既定値) | 'article' | 'all-text'

Password — PDF ファイルを開くためのパスワード 文字ベクトル | string スカラー

Pages — PDF ファイルから読み取るページ 正の整数のベクトル

ヒント

バージョン履歴

R2020b: extractFileText は Microsoft Word 97–2003 バイナリ DOC ファイルからのテキストの抽出をサポートしない

参考

トピック

テキストファイルからのテキストデータの抽出

PDF からのテキストデータの抽出

ファイルデータストアを使用した複数ファイルからのテキストのインポート

`filename` — ファイルの名前
string スカラー | 文字ベクトル | 文字ベクトルを含む 1 行 1 列の cell 配列

`Encoding` — 文字エンコード
`'auto'` (既定値) | `'UTF-8'` | `'ISO-8859-1'` | `'windows-1251'` | `'windows-1252'` | ...

`ExtractionMethod` — 抽出方法
`'tree'` (既定値) | `'article'` | `'all-text'`

`Password` — PDF ファイルを開くためのパスワード
文字ベクトル | string スカラー

`Pages` — PDF ファイルから読み取るページ
正の整数のベクトル

R2020b: `extractFileText` は Microsoft Word 97–2003 バイナリ DOC ファイルからのテキストの抽出をサポートしない