regexprep: Nested ordinal token not captured

Question

0 投票

I am trying to modify file paths with consecutive repeated folder names, e.g, "archive" is repeated in "Clients/archive/archive/20220428.1349.zip". The modification I seek is to truncate that path beyond the 2nd occurance of a repeated folder, leaving the trailing file path separator, e.g., "Clients/archive/archive/". I thought this would do it:

FolderInSelf = regexprep( FolderInSelf, ...
    "^(.*/(\w+)/\2/).*", "$1" );

"FolderInSelf" is vertical vector of strings, each representing a file path that contains a consecutively repeated folder name.

The outer set of brackets captures the 1st token, which is for the path upto the repeated folder, excluding anything after the slash.

The inner set of brackets is the 2nd token, which is the for the first occurrence of the repeated folder name ("archive" in the example above).

The back reference "\2" describes the fact that the token is repeated, and separated by a slash.

I am puzzled by why the above "regexprep" does nothing to the strings in FolderInSelf. To troubleshoot, I chose a simpler command that worked as expected

>> regexprep( "Clients/archive/archive/20220428.1349.zip", ...
              "^(.*/(archive)/archive/).*", "$1" )
   ans = "Clients/archive/archive/"

If I replace "$1" with "$2", I expect to get "archive" (the 2nd token). Instead, I get:

ans = "$2"

This suggest that the 2nd token is not being captured. Can anyone point out what I am doing wrong?

1 件のコメント
-1 件の古いコメントを表示 -1 件の古いコメントを非表示

FM 2023 年 1 月 5 日

編集済み: FM 2023 年 1 月 5 日

Thanks, Rik, You're absolutely right, at least as of 2018: https://www.mathworks.com/matlabcentral/answers/436217-regular-expression-are-nesting-of-group-operators-supported

If you don't mind posting this as the answer, I'll mark it as answered.

This is quite a severe limitation in regular expressions. :(

サインインしてコメントする。

サインインしてこの質問に回答する。

Follow Question

Answer 1

Rik 2023 年 1 月 5 日

編集済み: Rik 2023 年 1 月 5 日

MATLAB Online で開く

1 投票

I'm not entirely sure tokens can be nested (at least in the implementation that Matlab uses).

You can also explore the output of your tokens first with regexp:

regexp( "Clients/archive/archive/20220428.1349.zip", ...
    "^(.*/(archive)/archive/).*", "tokens" )
ans = 1×1 cell array
    {["Clients/archive/archive/"]}

I suspect the inner parentheses are considered grouping, not token-capturing.

I just tested this on the oldest Matlab I can run (v6.5 from 2002, which requires a bit of trickery to extract the tokens), and there the result is the same as in the online editor. So the remarks from the thread you found hold for just about any release of Matlab you can still get to run.

I might interest you to know that the output on GNU Octave (a mostly-compatible software suite) is not the same:

x=regexp( 'Clients/archive/archive/20220428.1349.zip', '^(.*/(archive)/archive/).*', 'tokens' )

x =

{

[1,1] =

{

[1,1] = Clients/archive/archive/

[1,2] = archive

}

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

Rik 2023 年 1 月 5 日

I understand it may not be a solution for you, but I just wanted to put it out there in case it solves the issue for someone else.

Reading your comment, I don't believe I have a suggestion you have not thought of.

FM 2023 年 1 月 5 日

That's good. Hopefully it will help someone.

サインインしてコメントする。

Answer 2

FM 2023 年 1 月 5 日

編集済み: FM 2023 年 1 月 7 日

MATLAB Online で開く

0 投票

If table "tFolderInSelf" contains a column "Path" consisting of a vertical vector of strings, then the following code truncates the paths after the second consecutive repetition of a folder name:

% Extract the repeated folder names
tFolderInSelf.Folder_x2 = regexp( tFolderInSelf.Path, ...
   "\<([\w.-]+)/\1/" , "match", "once" )
% Match the path upto the repeated folder name
tFolderInSelf.PathTrunc = regexp( tFolderInSelf.Path, ...
   ".*\<"+tFolderInSelf.Folder_x2, "match", "once" );
% Move the match "PathTrunc" next to "Path" for comparison
tFolderInSelf = movevars( ...
   tFolderInSelf, "PathTrunc", After="Path" );
% Cleaned-up viewing                            
categorical( unique( tFolderInSelf.PathTrunc ) )
                                                
Clients/archive/archive/                        
IT/sync/sync/                                   
Knowledge/Bayes/Bayes/                          
Knowledge/MathPrg/MIPsolveSpd/MIPsolveSpd/      
PD/SLT/20220411-0627/20220411/20220411/         
PD/SLT/20220411-0627/20220425/20220425/         
<...etc...>

Each row of the column tFolderInSelf.PathTrunc is a scalar string. The "regexp" option "Once" ensures that each row has only one element. This allows "regexp" to return an column vector of strings rather than a column vector of cells, as it does not have to accommodate variable length row vectors of strings for each table row.

It is possible that this code can be broken if one of the paths in tFolderInSelf.Path does not contain a repeated folder. In my case, the data set was built using only paths that contain repeated folders.

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

サインインしてコメントする。

regexprep: Nested ordinal token not captured

1 件のコメント
-1 件の古いコメントを表示 -1 件の古いコメントを非表示

採用された回答

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

カテゴリ

製品

リリース

タグ

Community Treasure Hunt

regexprep: Nested ordinal token not captured

1 件のコメント -1 件の古いコメントを表示 -1 件の古いコメントを非表示

採用された回答

3 件のコメント 1 件の古いコメントを表示 1 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント -2 件の古いコメントを表示 -2 件の古いコメントを非表示

カテゴリ

製品

リリース

タグ

参考

Community Treasure Hunt

1 件のコメント
-1 件の古いコメントを表示 -1 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示 1 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示