Web scraping with regular expression, getting rid of html tags.

2 ビュー (過去 30 日間)
pietro 2017 年 6 月 3 日
編集済み: pietro 2017 年 6 月 4 日
Hi all,
I am doing some webscraping code and consequently, I am using regular expressions. I need to isolate the words from a string, of course html tags should not be included. Html tags are words included in < > (e.g. br). Unfortunately, my code does not work out and I am wondering why. Here an example:
regexp('qu <qa>','(?!<)\w*(?!>)','match')
My expected results is 'qu' but instead I get 'qu' and 'q'. The code works with this string 'qu q'. What may I do to solve this issue?
The following code works regexp('qu qa','(?!<)\w*(?!>)','match')


Guillaume 2017 年 6 月 3 日
The first part of your expression is a look-ahead. You want a look behind instead. Add a < before the !:
regexp('qu <qa>', '(?<!<)\w*(?!>)', 'match')
  3 件のコメント
pietro 2017 年 6 月 4 日
編集済み: pietro 2017 年 6 月 4 日
thanks for your reply. I haven't thought of using regexprep


その他の回答 (0 件)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by