MATLAB Answers

Ajpaezm
0

I want to extract the page buttons/widgets in a website using URLREAD.

Ajpaezm
さんによって質問されました 2017 年 9 月 13 日
最新アクティビティ Cedric Wannaz
さんによって 編集されました 2017 年 9 月 14 日
I want to learn what is the common expression for Buttons/Widgets that contain page numbers of a catalog, e.g. like in this website . In this capture you'll see what are the numbers I'd like to get using URLread command.
Do you know how to do this? You'd help me A LOT if you can. I already tried printing everything into a .txt file but I can't write the whole HTML code into it. My plan was to look for the common expression manually but I couldn't print the whole outcome of URLread into the .txt file.
Thanks a lot,
Aquiles

  3 件のコメント

Walter Roberson
2017 年 9 月 14 日
The HTML for that section is
<ul class='pagination'>
<li class='disabled'><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=0'> < </a></li>
<li class='active'><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=1'>1</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=2'>2</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=3'>3</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=4'>4</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=5'>5</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=6'>6</a></li>
<li class='disabled'><span>...</span></li>
<li><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=83'>83</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=2'> > </a></li>
</ul>
So you want to look for index.php and page=\d+
Ajpaezm
2017 年 9 月 14 日
THANK YOU!
While I was writing "How did you do it?", I remembered Google Chrome had a source code viewer. It was that easy.
Thanks anyways for your time and help! :)
Walter Roberson
2017 年 9 月 14 日
Yup, I just visited the page in Firefox and hit command-U and scrolled through the HTML.

サインイン to comment.

1 件の回答

回答者: Cedric Wannaz
2017 年 9 月 14 日
編集済み: Cedric Wannaz
2017 年 9 月 14 日
 採用された回答

When you start clicking on pages, the page ID is in the URL, e.g.
https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=17
you can see it as the last URL parameter. It is therefore easy to build the URL for a given page with SPRINTF e.g. in a loop..
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
% Do something.
end
Then maybe you want to parse the HTML to get the table data, and you can use regular expressions for this. Training with page 1:
pageId = 1 ;
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
data = regexp( html, pattern, 'names' ) ;
With that you get:
>> data
data =
1×100 struct array with fields:
ibSymbol
externalUrl
name
symbol
currency
>> data(1)
ans =
struct with fields:
ibSymbol: 'AT'
externalUrl: 'https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=G…'
name: 'ATLANTIC POWER CORP'
symbol: 'AT'
currency: 'USD'
which is a struct array with the 100 entries of the table, including the URL of the page that you get in the popup window when you click on a product. So then you can work on parsing these pages:
html_ext = urlread( data(1).externalUrl ) ;
pattern_ext = '...' ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
I let you develop that part though! And putting everything together, you get a crawler/parser for the whole thing:
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
pattern_ext = '...' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
data = regexp( html, pattern, 'names' ) ;
for productId = 1 : numel( data )
html_ext = urlread( data(productId).externalUrl ) ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
% Do something.
end
end
That gives you a series of concepts/tools/examples that could be useful for what may come next in your developments.
PS: if you need to learn regular expressions in MATLAB, download the "MATLAB Programming Fundamentals" PDF document from
and go through the doc and examples on pages 2-42 to 2-73. It is a pretty good introduction/overview.

  0 件のコメント

サインイン to comment.



Translated by