HTML Page source info

Question

0 投票

Hello, many-a-times we come across a series of numbered webpages

basePage.html?page=2
basePage.html?page=3

and so forth, wherein there are several fields identified by their labels:

<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>

and so on.

How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,

basePage.html?page=1toInf

be taken (outputted/exported) into one text file, say, Parameter2.txt?

The "textOfInterest" is often alphanumeric with special characters !@#$% also.

Thanks.

6 件のコメント
4 件の古いコメントを表示 4 件の古いコメントを非表示

b 2020 年 12 月 1 日

Initially, I was hesitant to download this file because I thought it is religious or some such thing. But I am happy to have downloaded it. It is immensely useful and 'on the money' for this thread.

My interest occurs in the function button_Callback in BibleDownloader.m. The webpage is getting saved in the parameter called 'data'. And since finding <div class="pagination"> is right in the ballpark of my initially query, I was greatly excited to see the output and experiment with the case 'NB2014' inside this function. Unfortunately, the code doesn't seem to go here, since I was unable to retrieve either 'data', or the indices idx*. All of these indices idx*, viz idx, idx2 and idx3 will be useful for me. How can I access, and get to this part?

Also, perhaps you can suggest one regexp line to pull out 'textOfInterest' from

<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>

and better still, if you already have something like the BibleDownloader m-file, with regexp used on extracting text between <div class> and </div> type of structure, that will be great.

Rik 2020 年 12 月 1 日

編集済み: Rik 2020 年 12 月 1 日

The goal of Bible downloader is religious (although you can use the text of a Bible translation for non-religous purposes as well of course), but the code isn't.

Did you try adapting any of the code? I'll post some code as an answer.

サインインしてコメントする。

サインインしてこの質問に回答する。

Follow Question

Answer 1

Rik 2020 年 12 月 1 日

MATLAB Online で開く

0 投票

One possibility with strfind:

close_div=strfinf(d,'</div>');
param=1;
pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param)
position=strfind(d,pat);
position=position+numel(pat);%this will be the start of your text of interest
texts=cell(size(position));
for n=1:numel(position)
    end_of_text=close_div(close_div>position(n));
    end_of_text=end_of_text(1)-1;
    texts{n}=d(position(n):end_of_text);
end

Or with a regexp:

d=['<h2 class="category-heading">Name1</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name2</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name3</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'];
RE=['<label>Parameter\d',... % \d matches a single digit
    ' : </label> <div class="category-related">',...
    '(',... % use parentheses to capture a token
    '[^<]*',... % this matches any number of characters other than <
    ')',...
    '</div>'];
t=regexp(d,RE,'tokens');
clc
celldisp(t)

You can also adapt the expression to look forward to match </div> so you can use .* instead of [^<]*

8 件のコメント
6 件の古いコメントを表示 6 件の古いコメントを非表示

b 2020 年 12 月 1 日

MATLAB Online で開く

Thank you.

But I have run into problem with the following part:

Trying to take the output of the two parameters simultaneously: Parameter1 and Parameter2. It so happens, that many times, Parameter1 is present, but the Parameter2 is missing. That is, the structure is like this:

<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>

Same problem if try to take all the three parameters.

When all three parameters are to be extracted, the objective is to get ' ' (no value) at the place where it is missing, rather than skipping it completely, because skipping it completely would result in a mismatch (so that when it is exported to the output text file, the corresponding entry is simply blank).

In the first (strfind) code, I tried to replicate the 'for loop' three times for the three parameters, but quickly ran into problems.

b 2020 年 12 月 2 日

MATLAB Online で開く

Thanks for the link.

Downloaded the readfile from github. The 'elements' seems promising, except for - what are those ->->-> arrows in front of all the fields of interest?! Anyways, glad that it has brought to this point.

But the same situation with all the three approaches : when the mail-field is missing, then how to write 'NULL' in the output-file and continue with the loop?

Name1    mail1
Name2    missing
Name3    mail3
Name4    mail4

The strfind and regexp approaches give

Name{1}='Name1'
Name{2}='Name2'
Name{3}='Name3'
Name{4}='Name4'

and

Parameter{1}='mail1'
Parameter{2}='mail3'
Parameter{3}='mail4'

How to bypass the 'for loop' and at the same time, print 'NULL' in the corresponding excel row-column entry? In this example, (row=2,col=2) will be 'NULL', and (row=3,col=2) will be Parameter{2}.

It is not the question of 'skipping if not found', because numel(position) has already been evaluated, =4 here for the Name field, and =3 for the Parameter. So it seems to be hardcoded.

Rik 2020 年 12 月 2 日

Those arrows are probably newline characters. What release are you using?

I would suggest parsing each element separately. That way you can write an empty char or whatever you prefer in the email field for that person.

サインインしてコメントする。

Answer 2

b 2020 年 12 月 3 日

0 投票

That is exactly how I am doing it. By parsing it separately, there is no way to correlate which Name-field has the corresponding Mail-field missing. It parses all the Name-fields, then it parses all the mail-fields, as a sequential process.

What modification should be made in the codes, so that they print 'Not Found' when the mail field is missing in the corresponding iteration? Is there a way to get the index values of the missing Mail-fields?

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

b 2020 年 12 月 3 日

MATLAB Online で開く

I am overwhelmed by the way you have patiently worked with me on this thread. I think I will close this elaborate thread here only, but not before posting this limerick:

There was once a man named Rik, 
Who wrote matlab codes so quick, 
To the topic, they were relevant
The codes themselves so elegant, 
His m-files, sir, were completely sick!

Enjoy your freedom from this thread.

Rik 2020 年 12 月 3 日

You're welcome (and thanks for the limerick XD).

If you have follow-up question, feel free to post a link to it here.

サインインしてコメントする。

HTML Page source info

6 件のコメント
4 件の古いコメントを表示 4 件の古いコメントを非表示

採用された回答

8 件のコメント
6 件の古いコメントを表示 6 件の古いコメントを非表示

その他の回答 (1 件)

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

カテゴリ

タグ

Community Treasure Hunt

HTML Page source info

6 件のコメント 4 件の古いコメントを表示 4 件の古いコメントを非表示

採用された回答

8 件のコメント 6 件の古いコメントを表示 6 件の古いコメントを非表示

その他の回答 (1 件)

3 件のコメント 1 件の古いコメントを表示 1 件の古いコメントを非表示

カテゴリ

タグ

参考

Community Treasure Hunt

6 件のコメント
4 件の古いコメントを表示 4 件の古いコメントを非表示

8 件のコメント
6 件の古いコメントを表示 6 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示