Web scraping with regular expression, getting rid of html tags.

3 ビュー (過去 30 日間)
pietro
pietro 2017 年 6 月 3 日
編集済み: pietro 2017 年 6 月 4 日
Hi all,
I am doing some webscraping code and consequently, I am using regular expressions. I need to isolate the words from a string, of course html tags should not be included. Html tags are words included in < > (e.g. br). Unfortunately, my code does not work out and I am wondering why. Here an example:
regexp('qu <qa>','(?!<)\w*(?!>)','match')
My expected results is 'qu' but instead I get 'qu' and 'q'. The code works with this string 'qu q'. What may I do to solve this issue?
thanks
Regards,
Pietro
The following code works regexp('qu qa','(?!<)\w*(?!>)','match')

採用された回答

Guillaume
Guillaume 2017 年 6 月 3 日
The first part of your expression is a look-ahead. You want a look behind instead. Add a < before the !:
regexp('qu <qa>', '(?<!<)\w*(?!>)', 'match')
  3 件のコメント
Guillaume
Guillaume 2017 年 6 月 3 日
It's a lot more difficult to tell a regular expression not to match something than it is to tell it to match something. Therefore, I'd do it in two passes.
1. remove the tags:
notags = regexprep(yourstring, '<[^>]*>', '')
2. match whatever it is you want to match
matches = regexp(notags, '\w+', 'match')
pietro
pietro 2017 年 6 月 4 日
編集済み: pietro 2017 年 6 月 4 日
thanks for your reply. I haven't thought of using regexprep

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeCall Web Services from MATLAB Using HTTP についてさらに検索

製品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by