MATLAB Answers

Help with REGEXP: extracting info from a fragment of URL inside the HTML code.

3 ビュー (過去 30 日間)
Ajpaezm 2019 年 4 月 3 日
Commented: Ajpaezm 2019 年 5 月 4 日
Hey guys, I have used webread/urlread to get info from this site, the outcome is huge but I'm only interested in these lines:
<li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=-1'> < </a></li>
<li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=1'>1</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=2'>2</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=3'>3</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=4'>4</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=5'>5</a></li>
<li class='disabled'><span>...</span></li>
<li><a href='/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=22'>22</a></li>
If you notice, there's a 'segment' from the main url included in this part of the HTML code (this one: /en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=5). From this, I'd like to get the numbers at the very end of this fragment, or the numbers between the >< symbols (like 1, 2, 3, 4, 5 and 22).
I tried this foolishly thinking it was going to help but it didn't:
[a1, a2]=regexp(url, pattern,'match');
But it didn't work. Do you have any suggestions for this one? I previously tried '<li[^>]*><a[^>]*>(.*?)</a></li>' and 'tokens' option and although it captures these values, it also captures a lot of stuff I don't want.
Thanks for your help!

  2 件のコメント

Walter Roberson
Walter Roberson 2019 年 4 月 3 日
The Text Analytics Toolbox might have suitable tools.

サインイン to comment.


per isakson
per isakson 2019 年 5 月 3 日
編集済み: per isakson 2019 年 5 月 4 日
"Keep in mind that regular expressions are not a robust or neat way to parse HTML:" Anyhow, it can be used as an exercise on regular expressions.
>> cssm('h:\m\cssm\cssm.txt')
ans =
1 2 3 4 5 22
function num = cssm( ffs )
str = fileread( ffs );
xpr = '/en/index.php?f=2222&exch=IBIS&showcategories=STK&p=&cc=&limit=100&page=';
xpr = regexptranslate( 'escape', xpr );
xpr = ['(?<=',xpr,'\d+''>)\d+(?=<)'];
cac = regexp( str, xpr, 'match' );
num = str2double( cac );
and where h:\m\cssm\cssm.txt contains the html-code of the question.
The length of the look-behind-text varies because of the expression, '\d+', which may hamper performance.

  3 件のコメント

Ajpaezm 2019 年 5 月 4 日
Thanks a lot Per!
Truly. Do you know any great tutorials on Regex for Matlab? Any source? Regular Expressions is always a difficult topic for me. Some, and most of us agree to it, consider it to be an art itself haha.
Ajpaezm 2019 年 5 月 4 日
Oh I remember that post, I went back there a couple of times for other things, but for this one I couldn't find a solution with those resources. Actually I gave up this route and tried something else until your reply on this post today.
I was trying to do it with regexp directly, and with that piece of string that always seems to repeat itself in that part of the HTML code. But failed :/
I'll analyze this approach you used for HTML parsing. :)

サインイン to comment.

More Answers (0)

サインイン してこの質問に回答します。

Translated by