Regexp to extract all characters in a varied string up to match.

Question

Marshall 2014 年 11 月 12 日

1
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/162373-regexp-to-extract-all-characters-in-a-varied-string-up-to-match

コメント済み: Geoff Hayes 2014 年 11 月 13 日

Hello userbase,

I'm new to regexes. I'm working with some transistor test data and trying to extract information from .csv file names for sorting prior to further probing.

They have often a format such as this:

target = Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv
target = Some Other Test [123456_LS (further info including dates and temperatures)].csv

I want to extract the entire string up to the HS variant, including the optional number that follows it, as this represents the device and test. The further info relates to parameters.

The Some Test Performed section can be single or multiple words, contain special characters (&-_).

I'm looking for HS, LS, HS1, HS2, HS3, LS1, LS2, LS3.

I've tried lookbehind assertions, but it feels cludgy and I've guessed a bit:

pattern = '(?<=((HS)|(HS)\d|(LS)|(LS)\d))\s'

How can I improve this?

What does the ? normally do? (I see that here is a special case for the lookaround.)

My desired regexp(target, pattern, 'match') output would be:

match = Some Test Performed [12345678987_HS1
match = Some Other Test [123456_HS

Or at least the index of the final character so I could use target{1:match} to extract my string. Is there some useful 'from start or target until match' metacharacter?

Best regards and thanks for reading, Marshall

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Geoff Hayes 2014 年 11 月 12 日

2
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/162373-regexp-to-extract-all-characters-in-a-varied-string-up-to-match#answer_158655

MATLAB Online で開く

Marshall - if all of your target strings (the csv filenames) have an open bracket *(* in them, and you want all the characters before that, then you could use a strfind call to get the index of the open bracket, and then copy all characters up to that index. Something like

 target = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
 idx    = strfind(target, '(');
 if ~isempty(idx)
     match = strtrim(target(1:idx-1));
 end

which would return

 match =
    Some Test Performed [12345678987_HS1

However, if the open bracket rule is not valid for all cases, then you could try simplifying your pattern to

pattern = '.+[HL]S[\d\s]';

where

.+ means match on one or more single characters including whitespace (the plus sign means one or more);

[HL]S means a single character match on either H or L followed by an S; and

[\d\s] means match on either a single numeric character or any whitespace character.

So with your two target strings above, using this pattern we would see

 target1 = 'Some Test Performed [12345678987_HS1 (further info including dates and temperatures)].csv';
 target2 = 'Some Other Test [123456_LS (further info including dates and temperatures)].csv';
 pattern = '.+[HL]S[\d\s]';
 match1 = regexp(target1,pattern,'match');
 match2 = regexp(target2,pattern,'match');

with

 match1 = 
    'Some Test Performed [12345678987_HS1'
 match2 = 
    'Some Other Test [123456_LS '

A problem with the above pattern may occur when there are additional HS or LS characters that follow the first pattern match. For example, if your target is

 target3 = 'Some HS Test Performed [12345678987_HS1 (further info including dates and HS temperatures)].csv';
 match3 = regexp(target3,pattern,'match')

then string is found to be

 match3 = 
    'Some HS Test Performed [12345678987_HS1 (further info including dates and HS '

So you may want to narrow down the pattern to that where a numeric string followed by an underscore precedes your original pattern

 newPattern = '.+\d+_[HL]S[\d\s]'; 
 match3     = regexp(target3,newPattern,'match')

which returns the desired

 match3 = 
    'Some HS Test Performed [12345678987_HS1'

This new pattern will work for the other two targets as well.

Note that for the second match, we have a trailing whitespace character. You may want to wrap your regexp with a strtrim to remove it.

2 件のコメント
なしを表示なしを非表示

Marshall 2014 年 11 月 13 日

Hi, that's a great and thorough answer. Thanks for taking the time to explain the metacharacters too and to guess that the bracket after HS/LS isn't the standard case (it isn't)

And if I exclude the 'match' operator, the reason regexp returns [1] is because the start of that pattern begins at the start of the string?

strtrim is a good suggestion too. Thanks again :)

Geoff Hayes 2014 年 11 月 13 日

Glad to be able to help, Marshall. And yes, the [1] is returned when you remove the 'match' option because [1] is the start index of the pattern.

サインインしてコメントする。

Regexp to extract all characters in a varied string up to match.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

2 件のコメント
なしを表示なしを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

Regexp to extract all characters in a varied string up to match.

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

2 件のコメント なしを表示なしを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

2 件のコメント
なしを表示なしを非表示