I want to extract the page buttons/widgets in a website using URLREAD.
3 ビュー (過去 30 日間)
古いコメントを表示
I want to learn what is the common expression for Buttons/Widgets that contain page numbers of a catalog, e.g. like in this website . In this capture you'll see what are the numbers I'd like to get using URLread command.
Do you know how to do this? You'd help me A LOT if you can. I already tried printing everything into a .txt file but I can't write the whole HTML code into it. My plan was to look for the common expression manually but I couldn't print the whole outcome of URLread into the .txt file.
Thanks a lot,
Aquiles
3 件のコメント
Walter Roberson
2017 年 9 月 14 日
Yup, I just visited the page in Firefox and hit command-U and scrolled through the HTML.
採用された回答
Cedric
2017 年 9 月 14 日
編集済み: Cedric
2017 年 9 月 14 日
When you start clicking on pages, the page ID is in the URL, e.g.
https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=17
you can see it as the last URL parameter. It is therefore easy to build the URL for a given page with SPRINTF e.g. in a loop..
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
% Do something.
end
Then maybe you want to parse the HTML to get the table data, and you can use regular expressions for this. Training with page 1:
pageId = 1 ;
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
data = regexp( html, pattern, 'names' ) ;
With that you get:
>> data
data =
1×100 struct array with fields:
ibSymbol
externalUrl
name
symbol
currency
>> data(1)
ans =
struct with fields:
ibSymbol: 'AT'
externalUrl: 'https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=G…'
name: 'ATLANTIC POWER CORP'
symbol: 'AT'
currency: 'USD'
which is a struct array with the 100 entries of the table, including the URL of the page that you get in the popup window when you click on a product. So then you can work on parsing these pages:
html_ext = urlread( data(1).externalUrl ) ;
pattern_ext = '...' ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
I let you develop that part though! And putting everything together, you get a crawler/parser for the whole thing:
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
pattern_ext = '...' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
data = regexp( html, pattern, 'names' ) ;
for productId = 1 : numel( data )
html_ext = urlread( data(productId).externalUrl ) ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
% Do something.
end
end
That gives you a series of concepts/tools/examples that could be useful for what may come next in your developments.
PS: if you need to learn regular expressions in MATLAB, download the "MATLAB Programming Fundamentals" PDF document from
and go through the doc and examples on pages 2-42 to 2-73. It is a pretty good introduction/overview.
0 件のコメント
その他の回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Spreadsheets についてさらに検索
製品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!