How to access itemprop = "name" from within a data structure in HTML code using Matlab?

Question

1 投票

HTML code

<div class="itemName largestFont" itemprop="name"> Information which I want to extract </div>
<div class="itemCategory largeFont"><a href="/somerandomwebsitelink"> Information which I dont need </a></div>

I want to extract the information from itemprop = "name" only

using the selector feature with text analytics,

I can do "selector = "DIV.itemHeader"

Item Header is the class in which both those div elements lie and as a result both of the information within those divs is extracted.

I only want the information from itemprop = "name"

How do I go about doing that?

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

N/A 2019 年 3 月 26 日

Yup, thats correct

Walter Roberson 2019 年 3 月 26 日

Unfortunately I do not have that toolbox to test with.

My own implementation would probably be to use regexp with named tokens and the 'names' option.

サインインしてコメントする。

サインインしてこの質問に回答する。

Follow Question

Answer 1

TADA 2019 年 3 月 26 日

編集済み: TADA 2019 年 3 月 27 日

MATLAB Online で開く

0 投票

I don't have the toolbox you mentioned, but it most likely uses xpath to parse the html...

I think the best options are xpath or regular expressions.

as far as I know to use xpath in matlab you have to use Java classes, but regular expressions are built in to matlab and they are very covenient.

The regex pattern could be something like that:

str = ['<div class="itemName largestFont" itemprop="name"> Information which I want to extract </div>'...
'<div class="itemCategory largeFont"><a href="/somerandomwebsitelink"> Information which I dont need </a></div>'];
match = regexp(str, '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>', 'names')
match = 
  struct with fields:
    data: ' Information which I want to extract '

11 件のコメント
9 件の古いコメントを表示 9 件の古いコメントを非表示

N/A 2019 年 3 月 28 日

編集済み: N/A 2019 年 3 月 28 日

MATLAB Online で開く

NOTE: When the functions were run, the outputs did not have semi colons. Please ignore the outputs having semicolons

When I run this

function [name] = getTitle(tree)
    selector = "DIV.itemHeader";
    nameSection = findElement(tree, selector);
    name = extractHTMLText(nameSection);
end

I get this in the command window

name = 
     Information I want
     
     Information I don't want

When I run this

selector = "DIV.itemHeader.itemName";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);

I get this in the command window

name =
  0×1 empty double column vector

When I run this

selector = "DIV.itemHeader[itemprop=""name""]";
nameSection = findElement(tree, selector);
name = extractHTMLText(nameSection);

I get this in the command window

Error using htmlTree/findElement (line 99)
Attribute selector 'itemprop="name"' is not supported.

When I run this

function name = getTitle(tree)
    selector = "DIV.itemHeader";
    nameSection = findElement(tree, selector);
    html = extractHTMLText(nameSection);
    
    regexPattern = '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>';
    name = regexp(html, regexPattern, 'names');
end
     

I get this in the command window

name = 
  0×0 empty struct array with fields:
    data
    

I want the output of

name = regexp(title, regexPattern, 'names');

to give me this in the command window

name = 
     Information I want
     

TADA 2019 年 3 月 28 日

MATLAB Online で開く

this was a real long shot:

selector = "DIV.itemHeader[itemprop=""name""]";

the regex doesn't work because that extractHTMLText returns an array of strings of the text and not the HTML...

can you post you HTML document so I can at least try the css selectors?

also I made a mistake with the selector earlier,

try that instead:

% this css selector is now valid if I got the structure of your html right
% and if matlab handle's css selectors correctly
selector = "DIV.itemHeader .itemName";

or that: (probably won't work either)

selector = "DIV.itemHeader [itemprop=""name""]"

or maybe (not sure as the htmlTree is only available starting 2018b so I don't have it):

function name = getTitle(tree)
    selector = "DIV.itemHeader";
    nameSection = findElement(tree, selector);
    html = nameSection.Content; % hopefully this will return the inner HTML
    
    regexPattern = '<div\s+(\w+="[^"]*"\s+)*itemprop="name"(\s+\w+="[^"]*")*\s*>(?<data>[^<]*)</div>';
    match = regexp(html, regexPattern, 'names');
    
    name = match.data;
end

N/A 2019 年 3 月 28 日

HALLELUJAH! :D

TADA 2019 年 3 月 28 日

Cheers

サインインしてコメントする。

Answer 2

Sean de Wolski 2019 年 3 月 28 日

編集済み: Sean de Wolski 2019 年 3 月 28 日

MATLAB Online で開く

0 投票

Using htmlTree, this is trivial:

tree = htmlTree(fileread('yourfile.html'))
div = tree.findElement('div')
item = div.getAttribute("itemprop")
names = item == "name"
div(names).extractHTMLText

4 件のコメント
2 件の古いコメントを表示 2 件の古いコメントを非表示

TADA 2019 年 3 月 28 日

Neither me nor Walter Robertson (as far as I know) work for mathworks... I'd gladly take that raise though :)

Sean de Wolski 2019 年 3 月 29 日

MATLAB Online で開く

@TADA, we're always hiring into MathWorks and have a distributor in Israel who may or may not be looking for MATLAB users.

@Shivam, this returns exactly what you want from your comment above:

s = string(webread("https://beta.trollandtoad.com/yugioh/invasion-of-chaos-ioc-unlimited-singles/manticore-of-darkness-ioc-067-ultra-rare-unlimited/1155511", weboptions('Timeout', 15)));
%%
tree = htmlTree(s)
%%
div = tree.findElement('div')
%%
item = div.getAttribute("itemprop")
%%
names = item == "name"
%%
div(names).extractHTMLText
ans = 
    "Manticore of Darkness - IOC-067 - Ultra Rare Unlimited"

サインインしてコメントする。

How to access itemprop = "name" from within a data structure in HTML code using Matlab?

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

採用された回答

11 件のコメント
9 件の古いコメントを表示 9 件の古いコメントを非表示

その他の回答 (1 件)

4 件のコメント
2 件の古いコメントを表示 2 件の古いコメントを非表示

カテゴリ

製品

リリース

タグ

Community Treasure Hunt

How to access itemprop = "name" from within a data structure in HTML code using Matlab?

3 件のコメント 1 件の古いコメントを表示 1 件の古いコメントを非表示

採用された回答

11 件のコメント 9 件の古いコメントを表示 9 件の古いコメントを非表示

その他の回答 (1 件)

4 件のコメント 2 件の古いコメントを表示 2 件の古いコメントを非表示

カテゴリ

製品

リリース

タグ

参考

Community Treasure Hunt

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

11 件のコメント
9 件の古いコメントを表示 9 件の古いコメントを非表示

4 件のコメント
2 件の古いコメントを表示 2 件の古いコメントを非表示