ファイルからのテキスト データの抽出
この例では、テキスト、HTML、Microsoft® Word、PDF、CSV、および Microsoft Excel® ファイルからテキスト データを抽出し、解析のために MATLAB® にインポートする方法を示します。
通常、テキスト データを MATLAB にインポートする最も簡単な方法は、関数 extractFileText
を使用することです。この関数は、テキスト、PDF、HTML、および Microsoft Word ファイルからテキスト データを抽出します。CSV および Microsoft Excel ファイルからテキストをインポートするには、readtable
を使用します。HTML コードからテキストを抽出するには、extractHTMLText
を使用します。PDF フォームからデータを読み取るには、readPDFFormData
を使用します。
テキスト ファイル
extractFileText
を使用して sonnets.txt
からテキストを抽出します。ファイル sonnets.txt
には、シェイクスピアのソネット集がプレーン テキストとして格納されています。
filename = "sonnets.txt";
str = extractFileText(filename);
2 つのタイトル "I
" と "II
" の間のテキストを抽出して、最初のソネットを表示します。
start = " I" + newline; fin = " II"; sonnet1 = extractBetween(str,start,fin)
sonnet1 = " From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee. "
改行文字で区切られた複数のドキュメントを含むテキスト ファイルに対しては、関数 readlines
を使用します。
filename = "multilineSonnets.txt";
str = readlines(filename)
str = 3×1 string
"From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee."
"When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."
"Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee."
Microsoft Word ドキュメント
extractFileText
を使用して sonnets.docx
からテキストを抽出します。ファイル exampleSonnets.docx
には、シェイクスピアのソネット集が Microsoft Word ドキュメントとして格納されています。
filename = "exampleSonnets.docx";
str = extractFileText(filename);
2 つのタイトル "II
" と "III
" の間のテキストを抽出して、2 番目のソネットを表示します。
start = " II" + newline; fin = " III"; sonnet2 = extractBetween(str,start,fin)
sonnet2 = " When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "
Microsoft Word ドキュメントの例では、各行の間に 2 つの改行文字が使用されています。これらの文字を単一の改行文字に置き換えるには、関数 replace
を使用します。
sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 = " When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "
PDF ファイル
PDF ドキュメントからテキストを抽出し、PDF フォームからデータを抽出します。
PDF ドキュメント
extractFileText
を使用して sonnets.pdf
からテキストを抽出します。ファイル exampleSonnets.pdf
には、シェイクスピアのソネット集が PDF として格納されています。
filename = "exampleSonnets.pdf";
str = extractFileText(filename);
2 つのタイトル "III
" と "IV
" の間のテキストを抽出して、3 番目のソネットを表示します。この PDF では、各改行文字の前にスペースがあります。
start = " III " + newline; fin = "IV"; sonnet3 = extractBetween(str,start,fin)
sonnet3 = " Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee. "
PDF フォーム
PDF フォームからテキスト データを読み取るには、readPDFFormData
を使用します。関数は、PDF フォーム フィールドからのデータを格納する struct を返します。
filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
event_type: "Thunderstorm Wind"
event_narrative: "Large tree down between Plantersville and Nettleton."
HTML
HTML ファイル、HTML コード、および Web からテキストを抽出します。
HTML ファイル
保存済みの HTML ファイルからテキスト データを抽出するには、extractFileText
を使用します。
filename = "exampleSonnets.html";
str = extractFileText(filename);
2 つのタイトル "IV"
と "V"
の間のテキストを抽出して、4 番目のソネットを表示します。
start = newline + "IV" + newline; fin = newline + "V" + newline; sonnet4 = extractBetween(str,start,fin)
sonnet4 = " Unthrifty loveliness, why dost thou spend Upon thy self thy beauty's legacy? Nature's bequest gives nothing, but doth lend, And being frank she lends to those are free: Then, beauteous niggard, why dost thou abuse The bounteous largess given thee to give? Profitless usurer, why dost thou use So great a sum of sums, yet canst not live? For having traffic with thy self alone, Thou of thy self thy sweet self dost deceive: Then how when nature calls thee to be gone, What acceptable audit canst thou leave? Thy unused beauty must be tombed with thee, Which, used, lives th' executor to be. "
HTML コード
HTML コードを含む文字列からテキスト データを抽出するには、extractHTMLText
を使用します。
code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = "THE SONNETS by William Shakespeare"
Web から
Web ページからテキスト データを抽出するには、まず webread
を使用して HTML コードを読み取り、次に extractHTMLText
を使用します。
url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str = 'Text Analytics Toolbox Analyze and model text data Release Notes PDF Documentation Release Notes PDF Documentation Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling. Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models. Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data. Get Started Learn the basics of Text Analytics Toolbox Text Data Preparation Import text data into MATLAB® and preprocess it for analysis Modeling and Prediction Develop predictive models using topic models and word embeddings Display and Presentation Visualize text data and models using word clouds and text scatter plots Language Support Information on language support in Text Analytics Toolbox'
HTML コードの解析
HTML コードの特定の要素を見つけるには、htmlTree
を使用してコードを解析し、findElement
を使用します。HTML コードを解析し、すべてのハイパーリンクを見つけます。ハイパーリンクは、要素名が "A"
のノードです。
tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);
最初の 10 個のサブツリーを表示し、extractHTMLText
を使用してテキストを抽出します。
subtrees(1:10)
ans = 10×1 htmlTree: <A class="skip_link sr-only" href="#skip_link_anchor">Skip to content</A> <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link navbar-brand"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A> <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A> <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A> <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A> <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A> <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A> <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A> <A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A> <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link pull-left"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
str = extractHTMLText(subtrees);
最初の 10 個のハイパーリンクの抽出テキストを表示します。
str(1:10)
ans = 10×1 string
"Skip to content"
""
"Products"
"Solutions"
"Academia"
"Support"
"Community"
"Events"
"Get MATLAB"
""
リンク ターゲットを取得するには、getAttributes
を使用し、属性 "href"
(hyperlink reference: ハイパーリンク参照) を指定します。最初の 10 個のサブツリーのリンク ターゲットを取得します。
attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string
"#skip_link_anchor"
"https://www.mathworks.com?s_tid=gn_logo"
"https://www.mathworks.com/products.html?s_tid=gn_ps"
"https://www.mathworks.com/solutions.html?s_tid=gn_sol"
"https://www.mathworks.com/academia.html?s_tid=gn_acad"
"https://www.mathworks.com/support.html?s_tid=gn_supp"
"https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
"https://www.mathworks.com/company/events.html?s_tid=gn_ev"
"https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml"
"https://www.mathworks.com?s_tid=gn_logo"
CSV ファイルおよび Microsoft Excel ファイル
CSV ファイルおよび Microsoft Excel ファイルからテキスト データを抽出するには、readtable
を使用し、返された table からテキスト データを抽出します。
関数 readtable
を使用して factoryReposts.csv
から table のデータを抽出し、table の最初の数行を表示します。
T = readtable('factoryReports.csv','TextType','string'); head(T)
Description Category Urgency Resolution Cost _____________________________________________________________________ ____________________ ________ ____________________ _____ "Items are occasionally getting stuck in the scanner spools." "Mechanical Failure" "Medium" "Readjust Machine" 45 "Loud rattling and banging sounds are coming from assembler pistons." "Mechanical Failure" "Medium" "Readjust Machine" 35 "There are cuts to the power when starting the plant." "Electronic Failure" "High" "Full Replacement" 16200 "Fried capacitors in the assembler." "Electronic Failure" "High" "Replace Components" 352 "Mixer tripped the fuses." "Electronic Failure" "Low" "Add to Watch List" 55 "Burst pipe in the constructing agent is spraying coolant." "Leak" "High" "Replace Components" 371 "A fuse is blown in the mixer." "Electronic Failure" "Low" "Replace Components" 441 "Things continue to tumble off of the belt." "Mechanical Failure" "Low" "Readjust Machine" 38
event_narrative
列からテキスト データを抽出し、最初のいくつかの string を表示します。
str = T.Description; str(1:10)
ans = 10×1 string
"Items are occasionally getting stuck in the scanner spools."
"Loud rattling and banging sounds are coming from assembler pistons."
"There are cuts to the power when starting the plant."
"Fried capacitors in the assembler."
"Mixer tripped the fuses."
"Burst pipe in the constructing agent is spraying coolant."
"A fuse is blown in the mixer."
"Things continue to tumble off of the belt."
"Falling items from the conveyor belt."
"The scanner reel is split, it will soon begin to curve."
複数ファイルからのテキストの抽出
テキスト データが 1 つのフォルダー内の複数のファイルに含まれている場合、ファイル データストアを使用してテキスト データを MATLAB にインポートできます。
この例のソネット テキスト ファイル用のファイル データストアを作成します。例のファイルの名前は "exampleSonnetN.txt
" です。ここで、N
はソネットの番号です。ワイルドカード "*" を使用してファイル名を指定し、この構造のファイル名をすべて見つけます。読み取り関数を extractFileText
に指定するには、関数ハンドルを使用してこの関数を fileDatastore
に入力します。
location = "exampleSonnet*.txt"; fds = fileDatastore(location,'ReadFcn',@extractFileText);
データストア内のファイルをループ処理して、各テキスト ファイルを読み取ります。
str = []; while hasdata(fds) textData = read(fds); str = [str; textData]; end
抽出されたテキストを表示します。
str
str = 4×1 string
" From fairest creatures we desire increase,↵ That thereby beauty's rose might never die,↵ But as the riper should by time decease,↵ His tender heir might bear his memory:↵ But thou, contracted to thine own bright eyes,↵ Feed'st thy light's flame with self-substantial fuel,↵ Making a famine where abundance lies,↵ Thy self thy foe, to thy sweet self too cruel:↵ Thou that art now the world's fresh ornament,↵ And only herald to the gaudy spring,↵ Within thine own bud buriest thy content,↵ And tender churl mak'st waste in niggarding:↵ Pity the world, or else this glutton be,↵ To eat the world's due, by the grave and thee."
" When forty winters shall besiege thy brow,↵ And dig deep trenches in thy beauty's field,↵ Thy youth's proud livery so gazed on now,↵ Will be a tatter'd weed of small worth held:↵ Then being asked, where all thy beauty lies,↵ Where all the treasure of thy lusty days;↵ To say, within thine own deep sunken eyes,↵ Were an all-eating shame, and thriftless praise.↵ How much more praise deserv'd thy beauty's use,↵ If thou couldst answer 'This fair child of mine↵ Shall sum my count, and make my old excuse,'↵ Proving his beauty by succession thine!↵ This were to be new made when thou art old,↵ And see thy blood warm when thou feel'st it cold."
" Look in thy glass and tell the face thou viewest↵ Now is the time that face should form another;↵ Whose fresh repair if now thou not renewest,↵ Thou dost beguile the world, unbless some mother.↵ For where is she so fair whose unear'd womb↵ Disdains the tillage of thy husbandry?↵ Or who is he so fond will be the tomb,↵ Of his self-love to stop posterity?↵ Thou art thy mother's glass and she in thee↵ Calls back the lovely April of her prime;↵ So thou through windows of thine age shalt see,↵ Despite of wrinkles this thy golden time.↵ But if thou live, remember'd not to be,↵ Die single and thine image dies with thee."
" Unthrifty loveliness, why dost thou spend↵ Upon thy self thy beauty's legacy?↵ Nature's bequest gives nothing, but doth lend,↵ And being frank she lends to those are free:↵ Then, beauteous niggard, why dost thou abuse↵ The bounteous largess given thee to give?↵ Profitless usurer, why dost thou use↵ So great a sum of sums, yet canst not live?↵ For having traffic with thy self alone,↵ Thou of thy self thy sweet self dost deceive:↵ Then how when nature calls thee to be gone,↵ What acceptable audit canst thou leave?↵ Thy unused beauty must be tombed with thee,↵ Which, used, lives th' executor to be."
参考
pdfinfo
| extractFileText
| readPDFFormData
| extractHTMLText
| tokenizedDocument