Main Content

ファイルからのテキスト データの抽出

この例では、テキスト、HTML、Microsoft® Word、PDF、CSV、および Microsoft Excel® ファイルからテキスト データを抽出し、解析のために MATLAB® にインポートする方法を示します。

通常、テキスト データを MATLAB にインポートする最も簡単な方法は、関数 extractFileText を使用することです。この関数は、テキスト、PDF、HTML、および Microsoft Word ファイルからテキスト データを抽出します。CSV および Microsoft Excel ファイルからテキストをインポートするには、readtable を使用します。HTML コードからテキストを抽出するには、extractHTMLText を使用します。PDF フォームからデータを読み取るには、readPDFFormData を使用します。

テキスト ファイル

extractFileText を使用して sonnets.txt からテキストを抽出します。ファイル sonnets.txt には、シェイクスピアのソネット集がプレーン テキストとして格納されています。

filename = "sonnets.txt";
str = extractFileText(filename);

2 つのタイトル "I" と "II" の間のテキストを抽出して、最初のソネットを表示します。

start = " I" + newline;
fin = " II";
sonnet1 = extractBetween(str,start,fin)
sonnet1 = 
    "
       From fairest creatures we desire increase,
       That thereby beauty's rose might never die,
       But as the riper should by time decease,
       His tender heir might bear his memory:
       But thou, contracted to thine own bright eyes,
       Feed'st thy light's flame with self-substantial fuel,
       Making a famine where abundance lies,
       Thy self thy foe, to thy sweet self too cruel:
       Thou that art now the world's fresh ornament,
       And only herald to the gaudy spring,
       Within thine own bud buriest thy content,
       And tender churl mak'st waste in niggarding:
         Pity the world, or else this glutton be,
         To eat the world's due, by the grave and thee.
     
      "

改行文字で区切られた複数のドキュメントを含むテキスト ファイルに対しては、関数 readlines を使用します。

filename = "multilineSonnets.txt";
str = readlines(filename)
str = 3×1 string
    "From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee."
    "When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."
    "Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee."

Microsoft Word ドキュメント

extractFileText を使用して sonnets.docx からテキストを抽出します。ファイル exampleSonnets.docx には、シェイクスピアのソネット集が Microsoft Word ドキュメントとして格納されています。

filename = "exampleSonnets.docx";
str = extractFileText(filename);

2 つのタイトル "II" と "III" の間のテキストを抽出して、2 番目のソネットを表示します。

start = " II" + newline;
fin = " III";
sonnet2 = extractBetween(str,start,fin)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
     
       And dig deep trenches in thy beauty's field,
     
       Thy youth's proud livery so gazed on now,
     
       Will be a tatter'd weed of small worth held:
     
       Then being asked, where all thy beauty lies,
     
       Where all the treasure of thy lusty days;
     
       To say, within thine own deep sunken eyes,
     
       Were an all-eating shame, and thriftless praise.
     
       How much more praise deserv'd thy beauty's use,
     
       If thou couldst answer 'This fair child of mine
     
       Shall sum my count, and make my old excuse,'
     
       Proving his beauty by succession thine!
     
         This were to be new made when thou art old,
     
         And see thy blood warm when thou feel'st it cold.
     
      "

Microsoft Word ドキュメントの例では、各行の間に 2 つの改行文字が使用されています。これらの文字を単一の改行文字に置き換えるには、関数 replace を使用します。

sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
       And dig deep trenches in thy beauty's field,
       Thy youth's proud livery so gazed on now,
       Will be a tatter'd weed of small worth held:
       Then being asked, where all thy beauty lies,
       Where all the treasure of thy lusty days;
       To say, within thine own deep sunken eyes,
       Were an all-eating shame, and thriftless praise.
       How much more praise deserv'd thy beauty's use,
       If thou couldst answer 'This fair child of mine
       Shall sum my count, and make my old excuse,'
       Proving his beauty by succession thine!
         This were to be new made when thou art old,
         And see thy blood warm when thou feel'st it cold.
      "

PDF ファイル

PDF ドキュメントからテキストを抽出し、PDF フォームからデータを抽出します。

PDF ドキュメント

extractFileText を使用して sonnets.pdf からテキストを抽出します。ファイル exampleSonnets.pdf には、シェイクスピアのソネット集が PDF として格納されています。

filename = "exampleSonnets.pdf";
str = extractFileText(filename);

2 つのタイトル "III" と "IV" の間のテキストを抽出して、3 番目のソネットを表示します。この PDF では、各改行文字の前にスペースがあります。

start = " III " + newline;
fin = "IV";
sonnet3 = extractBetween(str,start,fin)
sonnet3 = 
    " 
       Look in thy glass and tell the face thou viewest 
       Now is the time that face should form another; 
       Whose fresh repair if now thou not renewest, 
       Thou dost beguile the world, unbless some mother. 
       For where is she so fair whose unear'd womb 
       Disdains the tillage of thy husbandry? 
       Or who is he so fond will be the tomb, 
       Of his self-love to stop posterity? 
       Thou art thy mother's glass and she in thee 
       Calls back the lovely April of her prime; 
       So thou through windows of thine age shalt see, 
       Despite of wrinkles this thy golden time. 
         But if thou live, remember'd not to be, 
         Die single and thine image dies with thee. 
     
     
      
       "

PDF フォーム

PDF フォームからテキスト データを読み取るには、readPDFFormData を使用します。関数は、PDF フォーム フィールドからのデータを格納する struct を返します。

filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
         event_type: "Thunderstorm Wind"
    event_narrative: "Large tree down between Plantersville and Nettleton."

HTML

HTML ファイル、HTML コード、および Web からテキストを抽出します。

HTML ファイル

保存済みの HTML ファイルからテキスト データを抽出するには、extractFileText を使用します。

filename = "exampleSonnets.html";
str = extractFileText(filename);

2 つのタイトル "IV""V" の間のテキストを抽出して、4 番目のソネットを表示します。

start = newline + "IV" + newline;
fin = newline + "V" + newline;
sonnet4 = extractBetween(str,start,fin)
sonnet4 = 
    "
     Unthrifty loveliness, why dost thou spend
     Upon thy self thy beauty's legacy?
     Nature's bequest gives nothing, but doth lend,
     And being frank she lends to those are free:
     Then, beauteous niggard, why dost thou abuse
     The bounteous largess given thee to give?
     Profitless usurer, why dost thou use
     So great a sum of sums, yet canst not live?
     For having traffic with thy self alone,
     Thou of thy self thy sweet self dost deceive:
     Then how when nature calls thee to be gone,
     What acceptable audit canst thou leave?
     Thy unused beauty must be tombed with thee,
     Which, used, lives th' executor to be.
     "

HTML コード

HTML コードを含む文字列からテキスト データを抽出するには、extractHTMLText を使用します。

code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = 
    "THE SONNETS
     
     by William Shakespeare"

Web から

Web ページからテキスト データを抽出するには、まず webread を使用して HTML コードを読み取り、次に extractHTMLText を使用します。

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str = 
    'Text Analytics Toolbox
     
     Analyze and model text data 
     
     Release Notes
     
     PDF Documentation
     
     Release Notes
     
     PDF Documentation
     
     Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
     
     Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
     
     Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.
     
     Get Started
     
     Learn the basics of Text Analytics Toolbox
     
     Text Data Preparation
     
     Import text data into MATLAB® and preprocess it for analysis
     
     Modeling and Prediction
     
     Develop predictive models using topic models and word embeddings
     
     Display and Presentation
     
     Visualize text data and models using word clouds and text scatter plots
     
     Language Support
     
     Information on language support in Text Analytics Toolbox'

HTML コードの解析

HTML コードの特定の要素を見つけるには、htmlTree を使用してコードを解析し、findElement を使用します。HTML コードを解析し、すべてのハイパーリンクを見つけます。ハイパーリンクは、要素名が "A" のノードです。

tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);

最初の 10 個のサブツリーを表示し、extractHTMLText を使用してテキストを抽出します。

subtrees(1:10)
ans = 
  10×1 htmlTree:

    <A class="skip_link sr-only" href="#skip_link_anchor">Skip to content</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link navbar-brand"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link pull-left"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>

str = extractHTMLText(subtrees);

最初の 10 個のハイパーリンクの抽出テキストを表示します。

str(1:10)
ans = 10×1 string
    "Skip to content"
    ""
    "Products"
    "Solutions"
    "Academia"
    "Support"
    "Community"
    "Events"
    "Get MATLAB"
    ""

リンク ターゲットを取得するには、getAttributes を使用し、属性 "href" (hyperlink reference: ハイパーリンク参照) を指定します。最初の 10 個のサブツリーのリンク ターゲットを取得します。

attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string
    "#skip_link_anchor"
    "https://www.mathworks.com?s_tid=gn_logo"
    "https://www.mathworks.com/products.html?s_tid=gn_ps"
    "https://www.mathworks.com/solutions.html?s_tid=gn_sol"
    "https://www.mathworks.com/academia.html?s_tid=gn_acad"
    "https://www.mathworks.com/support.html?s_tid=gn_supp"
    "https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
    "https://www.mathworks.com/company/events.html?s_tid=gn_ev"
    "https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml"
    "https://www.mathworks.com?s_tid=gn_logo"

CSV ファイルおよび Microsoft Excel ファイル

CSV ファイルおよび Microsoft Excel ファイルからテキスト データを抽出するには、readtable を使用し、返された table からテキスト データを抽出します。

関数 readtable を使用して factoryReposts.csv から table のデータを抽出し、table の最初の数行を表示します。

T = readtable('factoryReports.csv','TextType','string');
head(T)
                                 Description                                       Category          Urgency          Resolution         Cost 
    _____________________________________________________________________    ____________________    ________    ____________________    _____

    "Items are occasionally getting stuck in the scanner spools."            "Mechanical Failure"    "Medium"    "Readjust Machine"         45
    "Loud rattling and banging sounds are coming from assembler pistons."    "Mechanical Failure"    "Medium"    "Readjust Machine"         35
    "There are cuts to the power when starting the plant."                   "Electronic Failure"    "High"      "Full Replacement"      16200
    "Fried capacitors in the assembler."                                     "Electronic Failure"    "High"      "Replace Components"      352
    "Mixer tripped the fuses."                                               "Electronic Failure"    "Low"       "Add to Watch List"        55
    "Burst pipe in the constructing agent is spraying coolant."              "Leak"                  "High"      "Replace Components"      371
    "A fuse is blown in the mixer."                                          "Electronic Failure"    "Low"       "Replace Components"      441
    "Things continue to tumble off of the belt."                             "Mechanical Failure"    "Low"       "Readjust Machine"         38

event_narrative 列からテキスト データを抽出し、最初のいくつかの string を表示します。

str = T.Description;
str(1:10)
ans = 10×1 string
    "Items are occasionally getting stuck in the scanner spools."
    "Loud rattling and banging sounds are coming from assembler pistons."
    "There are cuts to the power when starting the plant."
    "Fried capacitors in the assembler."
    "Mixer tripped the fuses."
    "Burst pipe in the constructing agent is spraying coolant."
    "A fuse is blown in the mixer."
    "Things continue to tumble off of the belt."
    "Falling items from the conveyor belt."
    "The scanner reel is split, it will soon begin to curve."

複数ファイルからのテキストの抽出

テキスト データが 1 つのフォルダー内の複数のファイルに含まれている場合、ファイル データストアを使用してテキスト データを MATLAB にインポートできます。

この例のソネット テキスト ファイル用のファイル データストアを作成します。例のファイルの名前は "exampleSonnetN.txt" です。ここで、N はソネットの番号です。ワイルドカード "*" を使用してファイル名を指定し、この構造のファイル名をすべて見つけます。読み取り関数を extractFileText に指定するには、関数ハンドルを使用してこの関数を fileDatastore に入力します。

location = "exampleSonnet*.txt";
fds = fileDatastore(location,'ReadFcn',@extractFileText);

データストア内のファイルをループ処理して、各テキスト ファイルを読み取ります。

str = [];
while hasdata(fds)
    textData = read(fds);
    str = [str; textData];
end

抽出されたテキストを表示します。

str
str = 4×1 string
    "  From fairest creatures we desire increase,↵  That thereby beauty's rose might never die,↵  But as the riper should by time decease,↵  His tender heir might bear his memory:↵  But thou, contracted to thine own bright eyes,↵  Feed'st thy light's flame with self-substantial fuel,↵  Making a famine where abundance lies,↵  Thy self thy foe, to thy sweet self too cruel:↵  Thou that art now the world's fresh ornament,↵  And only herald to the gaudy spring,↵  Within thine own bud buriest thy content,↵  And tender churl mak'st waste in niggarding:↵    Pity the world, or else this glutton be,↵    To eat the world's due, by the grave and thee."
    "  When forty winters shall besiege thy brow,↵  And dig deep trenches in thy beauty's field,↵  Thy youth's proud livery so gazed on now,↵  Will be a tatter'd weed of small worth held:↵  Then being asked, where all thy beauty lies,↵  Where all the treasure of thy lusty days;↵  To say, within thine own deep sunken eyes,↵  Were an all-eating shame, and thriftless praise.↵  How much more praise deserv'd thy beauty's use,↵  If thou couldst answer 'This fair child of mine↵  Shall sum my count, and make my old excuse,'↵  Proving his beauty by succession thine!↵    This were to be new made when thou art old,↵    And see thy blood warm when thou feel'st it cold."
    "  Look in thy glass and tell the face thou viewest↵  Now is the time that face should form another;↵  Whose fresh repair if now thou not renewest,↵  Thou dost beguile the world, unbless some mother.↵  For where is she so fair whose unear'd womb↵  Disdains the tillage of thy husbandry?↵  Or who is he so fond will be the tomb,↵  Of his self-love to stop posterity?↵  Thou art thy mother's glass and she in thee↵  Calls back the lovely April of her prime;↵  So thou through windows of thine age shalt see,↵  Despite of wrinkles this thy golden time.↵    But if thou live, remember'd not to be,↵    Die single and thine image dies with thee."
    "  Unthrifty loveliness, why dost thou spend↵  Upon thy self thy beauty's legacy?↵  Nature's bequest gives nothing, but doth lend,↵  And being frank she lends to those are free:↵  Then, beauteous niggard, why dost thou abuse↵  The bounteous largess given thee to give?↵  Profitless usurer, why dost thou use↵  So great a sum of sums, yet canst not live?↵  For having traffic with thy self alone,↵  Thou of thy self thy sweet self dost deceive:↵  Then how when nature calls thee to be gone,↵  What acceptable audit canst thou leave?↵    Thy unused beauty must be tombed with thee,↵    Which, used, lives th' executor to be."

参考

| | | |

関連するトピック