I want to extract the page buttons/widgets in a website using URLREAD.

5 ビュー (過去 30 日間)
Ajpaezm
Ajpaezm 2017 年 9 月 13 日
編集済み: Cedric 2017 年 9 月 14 日
I want to learn what is the common expression for Buttons/Widgets that contain page numbers of a catalog, e.g. like in this website . In this capture you'll see what are the numbers I'd like to get using URLread command.
Do you know how to do this? You'd help me A LOT if you can. I already tried printing everything into a .txt file but I can't write the whole HTML code into it. My plan was to look for the common expression manually but I couldn't print the whole outcome of URLread into the .txt file.
Thanks a lot,
Aquiles
  3 件のコメント
Ajpaezm
Ajpaezm 2017 年 9 月 14 日
THANK YOU!
While I was writing "How did you do it?", I remembered Google Chrome had a source code viewer. It was that easy.
Thanks anyways for your time and help! :)
Walter Roberson
Walter Roberson 2017 年 9 月 14 日
Yup, I just visited the page in Firefox and hit command-U and scrolled through the HTML.

サインインしてコメントする。

採用された回答

Cedric
Cedric 2017 年 9 月 14 日
編集済み: Cedric 2017 年 9 月 14 日
When you start clicking on pages, the page ID is in the URL, e.g.
https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=17
you can see it as the last URL parameter. It is therefore easy to build the URL for a given page with SPRINTF e.g. in a loop..
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
% Do something.
end
Then maybe you want to parse the HTML to get the table data, and you can use regular expressions for this. Training with page 1:
pageId = 1 ;
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
data = regexp( html, pattern, 'names' ) ;
With that you get:
>> data
data =
1×100 struct array with fields:
ibSymbol
externalUrl
name
symbol
currency
>> data(1)
ans =
struct with fields:
ibSymbol: 'AT'
externalUrl: 'https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=G…'
name: 'ATLANTIC POWER CORP'
symbol: 'AT'
currency: 'USD'
which is a struct array with the 100 entries of the table, including the URL of the page that you get in the popup window when you click on a product. So then you can work on parsing these pages:
html_ext = urlread( data(1).externalUrl ) ;
pattern_ext = '...' ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
I let you develop that part though! And putting everything together, you get a crawler/parser for the whole thing:
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
pattern_ext = '...' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
data = regexp( html, pattern, 'names' ) ;
for productId = 1 : numel( data )
html_ext = urlread( data(productId).externalUrl ) ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
% Do something.
end
end
That gives you a series of concepts/tools/examples that could be useful for what may come next in your developments.
PS: if you need to learn regular expressions in MATLAB, download the "MATLAB Programming Fundamentals" PDF document from
and go through the doc and examples on pages 2-42 to 2-73. It is a pretty good introduction/overview.

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeString Parsing についてさらに検索

製品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by