htmlTree

解析された HTML ツリー

説明

htmlTree オブジェクトは、解析された HTML 要素またはノードを表します。関数 findElement または Children プロパティを使用して目的のパーツを抽出し、関数 extractHTMLText を使用してテキストを抽出します。

作成

構文

tree = htmlTree(code)

説明

tree = htmlTree(code) は、string code 内の HTML コードを解析し、結果のツリー構造を返します。

ヒント

XML コードを解析するには、関数 readstruct を使用します。

例

入力引数

すべて展開する

`code` — HTML コード
string 配列 | 文字ベクトル | 文字ベクトルの cell 配列

HTML コード。string 配列、文字ベクトル、または文字ベクトルの cell 配列として指定します。

ヒント

Web ページから HTML コードを読み取るには、webread を使用します。
HTML ファイルからテキストを抽出するには、extractFileText を使用します。

例: "<a href='https://www.mathworks.com'>MathWorks</a>"

データ型: char | string | cell

プロパティ

すべて展開する

`Children` — 要素の直系の子孫
`htmlTree` 配列

要素の直系の子孫。htmlTree 配列として指定します。

`Parent` — 親ノード
`htmlTree` オブジェクト

ツリー内の親ノード。htmlTree オブジェクトとして指定します。

HTML ツリーがルートノードの場合、Parent の値は missing になります。

`Name` — HTML 要素名
string スカラー

HTML 要素名。string スカラーとして指定します。

詳細については、HTML 要素を参照してください。

オブジェクト関数

`findElement`	HTML ツリー内の要素の検出
`getAttribute`	HTML ツリーのルートノードの HTML 属性の読み取り
`extractHTMLText`	HTML からのテキストの抽出
`ismissing`	Find HTML trees without values

例

すべて折りたたむ

HTML コードの解析

ライブスクリプトを開く

webread を使用して、URL https://www.mathworks.com/help/textanalytics から HTML コードを読み取ります。

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);

htmlTree を使用して HTML コードを解析します。

tree = htmlTree(code);

ツリーのルートノードの要素名を表示します。

tree.Name

ans = 
"HTML"

ルートノードの子を表示します。

tree.Children

ans = 
  4×1 htmlTree:

    " "
    <HEAD><TITLE>Text Analytics Toolbox Documentation</TITLE><META charset="utf-8"/><META content="width=device-width, initial-scale=1.0" name="viewport"/><META content="IE=edge" http-equiv="X-UA-Compatible"/><LINK href="/includes_content/responsive/css/bootstrap/bootstrap.min.css" rel="stylesheet" type="text/css"/><LINK href="/includes_content/responsive/css/site6.css?201903" rel="stylesheet" type="text/css"/><LINK href="/includes_content/responsive/css/site6_lg.css?201903" media="screen and (min-width: 1200px)" rel="stylesheet"/><LINK href="/includes_content/responsive/css/site6_md.css?201903" media="screen and (min-width: 992px) and (max-width: 1199px)" rel="stylesheet"/><LINK href="/includes_content/responsive/css/site6_sm+xs.css?201903" media="screen and (max-width: 991px)" rel="stylesheet"/><LINK href="/includes_content/responsive/css/site6_sm.css?201903" media="screen and (min-width: 768px) and (max-width: 991px)" rel="stylesheet"/><LINK href="/includes_content/responsive/css/site6_…
    " "
    <BODY id="responsive_offcanvas"><!-- Mobile TopNav: Start --><DIV class="header visible-xs visible-sm" id="header_mobile" translate="no"><NAV class="navbar navbar-default" role="navigation"><DIV class="container-fluid"><DIV class="row"><DIV class="col-xs-12"><DIV class="navbar-header"><BUTTON class="navbar-toggle topnav_toggle" data-target="#topnav_collapse" data-toggle="collapse" type="button"><SPAN class="sr-only">Toggle Main Navigation</SPAN><SPAN class="icon-menu"/></BUTTON><A class="svg_link navbar-brand" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A></DIV></DIV></DIV><DIV class="row visible-xs visible-sm"><DIV class="col-xs-12"><DIV class="navbar-collapse collapse" id="topnav_collapse"><UL class="nav navbar-nav" id="topnav"><LI class="headernav_login"><A class="mwa-nav_login" href="https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html">Sign…

extractHTMLText を使用して HTML ツリーからテキストを抽出します。

str = extractHTMLText(tree)

str = 
    "Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
     
     Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
     
     Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data."

HTML ツリー内の要素の検出

ライブスクリプトを開く

関数 webread を使用して、URL https://www.mathworks.com/help/textanalytics から HTML コードを読み取ります。

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);

htmlTree を使用して HTML コードを解析します。

tree = htmlTree(code);

findElement を使用して、HTML ツリー内のすべてのハイパーリンクを見つけます。ハイパーリンクは、要素名が "A" のノードです。

selector = "A";
subtrees = findElement(tree,selector);

最初のいくつかのサブツリーを表示します。

subtrees(1:10)

ans = 
  10×1 htmlTree:

    <A class="skip_link sr-only" href="#content_container">Skip to content</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link navbar-brand"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link pull-left"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>

extractHTMLText を使用してサブツリーからテキストを抽出します。結果には、ページ上の各リンクから抽出したリンクテキストが含まれます。

str = extractHTMLText(subtrees);
str(1:10)

ans = 10×1 string
    "Skip to content"
    ""
    "Products"
    "Solutions"
    "Academia"
    "Support"
    "Community"
    "Events"
    "Get MATLAB"
    ""

HTML タグの属性の取得

ライブスクリプトを開く

webread を使用して、URL https://www.mathworks.com/help/textanalytics から HTML コードを読み取ります。

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);

htmlTree を使用して HTML コードを解析します。

tree = htmlTree(code);

findElement を使用して、HTML ツリー内のすべてのハイパーリンクを見つけます。ハイパーリンクは、要素名が "A" のノードです。

selector = "A";
subtrees = findElement(tree,selector);
subtrees(1:10)

ans = 
  10×1 htmlTree:

    <A class="svg_link navbar-brand" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A>
    <A class="mwa-nav_login" href="https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html">Sign In</A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus">Contact Us</A>
    <A href="https://www.mathworks.com/store?s_cid=store_top_nav&amp;s_tid=gn_store">How to Buy</A>

getAttribute を使用してハイパーリンク参照を取得します。属性名 "href" を指定します。

attr = "href";
str = getAttribute(subtrees,attr);
str(1:10)

ans = 10×1 string array
    "https://www.mathworks.com?s_tid=gn_logo"
    "https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html"
    "https://www.mathworks.com/products.html?s_tid=gn_ps"
    "https://www.mathworks.com/solutions.html?s_tid=gn_sol"
    "https://www.mathworks.com/academia.html?s_tid=gn_acad"
    "https://www.mathworks.com/support.html?s_tid=gn_supp"
    "https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
    "https://www.mathworks.com/company/events.html?s_tid=gn_ev"
    "https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus"
    "https://www.mathworks.com/store?s_cid=store_top_nav&s_tid=gn_store"

解析された HTML コードの string への変換

ライブスクリプトを開く

関数 webread を使用して、URL https://www.mathworks.com/help/textanalytics から HTML コードを読み取ります。

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);

関数 htmlTree を使用して HTML コードを解析します。

tree = htmlTree(code);

関数 findElement を使用して、HTML ツリー内のすべての段落を見つけます。段落は、要素名 "P" をもつノードです。

subtrees = findElement(tree,"P");

関数 string を使用してサブツリーを string に変換します。

str = string(subtrees)

str = 18×1 string
    "<P class="h1">↵  <A href="../index.html">Help Center</A>↵</P>"
    "<P>Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.</P>"
    "<P>Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.</P>"
    "<P>Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.</P>"
    "<P class="category_desc">Learn the basics of Text Analytics Toolbox</P>"
    "<P class="category_desc">Import text data into MATLAB<SUP>®</SUP> and preprocess it for analysis</P>"
    "<P class="category_desc">Develop predictive models using topic models and word embeddings</P>"
    "<P class="category_desc">Visualize text data and models using word clouds and text scatter plots</P>"
    "<P class="category_desc">Information on language support in Text Analytics Toolbox</P>"
    "<P>You clicked a link that corresponds to this MATLAB command:</P>"
    "<P>Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.</P>"
    "<P class="h1 icon-globe icon_color_secondary" id="country-unselected-title">Select a Web Site</P>"
    "<P>Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: <STRONG class="recommended-country"/>.</P>"
    "<P>You can also select a web site from the following list:</P>"
    "<P>Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.</P>"
    "<P class="text-center">↵  <A href="#" class="worldwide_link">Contact your local office</A>↵</P>"
    "<P class="copyright" translate="no">© 1994-2024 The MathWorks, Inc.</P>"
    "<P>↵  <EM>Join the conversation</EM>↵</P>"

詳細

すべて展開する

HTML 要素

一般的な HTML 要素には、次のコンポーネントが含まれます。

要素名 – HTML タグの名前。要素名は、HTML ツリーの Name プロパティに対応します。
属性 – タグに関する追加情報。HTML 属性の形式は name="value" で、name と value はそれぞれ属性名と値を示します。属性は HTML の開始タグの内部に出現します。HTML ツリーから属性値を取得するには、getAttribute を使用します。
コンテンツ – 要素のコンテンツ。コンテンツは HTML の開始タグと終了タグの間に出現します。コンテンツは、テキストデータまたは入れ子の HTML 要素です。htmlTree オブジェクトからテキストを抽出するには、extractHTMLText を使用します。htmlTree オブジェクトの入れ子の HTML 要素を取得するには、Children プロパティを使用します。

たとえば、HTML 要素 <a href="https://www.mathworks.com">Home</a> は、次のコンポーネントで構成されます。

コンポーネント		値	説明
要素名		`a`	要素はハイパーリンク
属性	属性名	`href`	ハイパーリンク参照
属性	属性値	`"https://www.mathworks.com"`	ハイパーリンク参照の値
コンテンツ		`Home`	表示するテキスト

バージョン履歴

R2018b で導入

すべて展開する

R2021a: `htmlTree` は不正な形式の HTML を再構築するために異なるアルゴリズムを使用する

htmlTree オブジェクトを作成すると、不正な形式の入力 HTML コードが有効な構造をもつように、ソフトウェアによって自動的に再構築されます。この再構築プロセスには、要素の追加、削除、編集、およびツリー構造の再配置が含まれます。R2021a 以降、ソフトウェアは更新されたアルゴリズムを使用して不正な形式の HTML を再構築します。この変更により、R2021a 以降で作成された htmlTree オブジェクトは、以前のリリースと比較して、サイズ、構造、およびコンテンツが異なる可能性があります。

R2021a 以降では、R2020b 以前で作成された MAT ファイルから htmlTree オブジェクトを読み込む際、ソフトウェアは、htmlTree オブジェクトの作成に使用されたものと同じアルゴリズムを使用して htmlTree オブジェクトを自動的に再構築します。R2021a 以降で作成された MAT ファイルから htmlTree オブジェクトを読み込む場合、ソフトウェアは htmlTree オブジェクトを再構築しません。

次の表は、再構築プロセスのいくつかの主なステップに注目して示しています。

ステップ動作の変更

head 要素と title 要素を自動的に追加する。

ステップ	動作の変更
head 要素と title 要素を自動的に追加する。	R2021a 以降では、HTML コードから `htmlTree` オブジェクトを作成すると、不足している `<HEAD>`、`<TITLE>`、およびその他の要素がソフトウェアによって自動的に挿入されます。以前のバージョンでは、`htmlTree` オブジェクトにこれらの要素が含められるのは、これらが入力コードに存在する場合のみでした。以前のリリースで作成された MAT ファイルから `htmlTree` オブジェクトを読み込む場合、`<HEAD>` 要素と `<TITLE>` 要素がソフトウェアによって自動的に挿入されます。R2021a 以降で作成された MAT ファイルから `htmlTree` オブジェクトを読み込む場合、ソフトウェアはこれらの要素を自動的に挿入しません。
不足している要素を自動的に追加する。	R2021a 以降では、HTML コードから `htmlTree` オブジェクトを作成する際に親要素と子要素が矛盾している場合、不足している要素がソフトウェアによって自動的に挿入されます。たとえば、`<li>` (リスト項目) 要素に親の `<ul>` (順序なしリスト) 要素または `<ol>` (順序付きリスト) 要素がない場合、HTML を有効にするために、`<ul>` 要素がソフトウェアによって自動的に追加されます。これにより、以前のリリースと比較すると、異なる出力になる可能性があります。以前のリリースで作成された MAT ファイルから `htmlTree` オブジェクトを読み込むと、不足している要素がソフトウェアによって自動的に挿入されます。R2021a 以降で作成された MAT ファイルから `htmlTree` オブジェクトを読み込む場合、ソフトウェアは不足している要素を自動的に挿入しません。
不正なコードの一部を破棄する。	不正な HTML コードで `htmlTree` オブジェクトを作成すると、ソフトウェアによってテキストの一部が破棄される場合があります。たとえば、入力コードが string `"<div>a</"` だった場合、ソフトウェアはテキスト `"a</"` を破棄します。

R2021a 以降では、HTML コードから htmlTree オブジェクトを作成すると、不足している <HEAD>、<TITLE>、およびその他の要素がソフトウェアによって自動的に挿入されます。以前のバージョンでは、htmlTree オブジェクトにこれらの要素が含められるのは、これらが入力コードに存在する場合のみでした。

以前のリリースで作成された MAT ファイルから htmlTree オブジェクトを読み込む場合、<HEAD> 要素と <TITLE> 要素がソフトウェアによって自動的に挿入されます。R2021a 以降で作成された MAT ファイルから htmlTree オブジェクトを読み込む場合、ソフトウェアはこれらの要素を自動的に挿入しません。

不足している要素を自動的に追加する。

R2021a 以降では、HTML コードから htmlTree オブジェクトを作成する際に親要素と子要素が矛盾している場合、不足している要素がソフトウェアによって自動的に挿入されます。たとえば、<li> (リスト項目) 要素に親の <ul> (順序なしリスト) 要素または <ol> (順序付きリスト) 要素がない場合、HTML を有効にするために、<ul> 要素がソフトウェアによって自動的に追加されます。これにより、以前のリリースと比較すると、異なる出力になる可能性があります。

以前のリリースで作成された MAT ファイルから htmlTree オブジェクトを読み込むと、不足している要素がソフトウェアによって自動的に挿入されます。R2021a 以降で作成された MAT ファイルから htmlTree オブジェクトを読み込む場合、ソフトウェアは不足している要素を自動的に挿入しません。

不正なコードの一部を破棄する。

不正な HTML コードで htmlTree オブジェクトを作成すると、ソフトウェアによってテキストの一部が破棄される場合があります。たとえば、入力コードが string "<div>a</" だった場合、ソフトウェアはテキスト "a</" を破棄します。

参考

htmlTree

説明

作成

構文

説明

入力引数

code — HTML コード string 配列 | 文字ベクトル | 文字ベクトルの cell 配列

プロパティ

Children — 要素の直系の子孫 htmlTree 配列

Parent — 親ノード htmlTree オブジェクト

Name — HTML 要素名 string スカラー

オブジェクト関数

例

HTML コードの解析

HTML ツリー内の要素の検出

HTML タグの属性の取得

解析された HTML コードの string への変換

詳細

HTML 要素

バージョン履歴

R2021a: htmlTree は不正な形式の HTML を再構築するために異なるアルゴリズムを使用する

参考

トピック

`code` — HTML コード
string 配列 | 文字ベクトル | 文字ベクトルの cell 配列

`Children` — 要素の直系の子孫
`htmlTree` 配列

`Parent` — 親ノード
`htmlTree` オブジェクト

`Name` — HTML 要素名
string スカラー

R2021a: `htmlTree` は不正な形式の HTML を再構築するために異なるアルゴリズムを使用する