regexprep: Nested ordinal token not captured

1 回表示 (過去 30 日間)
FM
FM 2023 年 1 月 5 日
編集済み: FM 2023 年 1 月 7 日
I am trying to modify file paths with consecutive repeated folder names, e.g, "archive" is repeated in "Clients/archive/archive/20220428.1349.zip". The modification I seek is to truncate that path beyond the 2nd occurance of a repeated folder, leaving the trailing file path separator, e.g., "Clients/archive/archive/". I thought this would do it:
FolderInSelf = regexprep( FolderInSelf, ...
"^(.*/(\w+)/\2/).*", "$1" );
"FolderInSelf" is vertical vector of strings, each representing a file path that contains a consecutively repeated folder name.
The outer set of brackets captures the 1st token, which is for the path upto the repeated folder, excluding anything after the slash.
The inner set of brackets is the 2nd token, which is the for the first occurrence of the repeated folder name ("archive" in the example above).
The back reference "\2" describes the fact that the token is repeated, and separated by a slash.
I am puzzled by why the above "regexprep" does nothing to the strings in FolderInSelf. To troubleshoot, I chose a simpler command that worked as expected
>> regexprep( "Clients/archive/archive/20220428.1349.zip", ...
"^(.*/(archive)/archive/).*", "$1" )
ans = "Clients/archive/archive/"
If I replace "$1" with "$2", I expect to get "archive" (the 2nd token). Instead, I get:
ans = "$2"
This suggest that the 2nd token is not being captured. Can anyone point out what I am doing wrong?
  1 件のコメント
FM
FM 2023 年 1 月 5 日
編集済み: FM 2023 年 1 月 5 日
If you don't mind posting this as the answer, I'll mark it as answered.
This is quite a severe limitation in regular expressions. :(

サインインしてコメントする。

採用された回答

Rik
Rik 2023 年 1 月 5 日
編集済み: Rik 2023 年 1 月 5 日
I'm not entirely sure tokens can be nested (at least in the implementation that Matlab uses).
You can also explore the output of your tokens first with regexp:
regexp( "Clients/archive/archive/20220428.1349.zip", ...
"^(.*/(archive)/archive/).*", "tokens" )
ans = 1×1 cell array
{["Clients/archive/archive/"]}
I suspect the inner parentheses are considered grouping, not token-capturing.
I just tested this on the oldest Matlab I can run (v6.5 from 2002, which requires a bit of trickery to extract the tokens), and there the result is the same as in the online editor. So the remarks from the thread you found hold for just about any release of Matlab you can still get to run.
I might interest you to know that the output on GNU Octave (a mostly-compatible software suite) is not the same:
x=regexp( 'Clients/archive/archive/20220428.1349.zip', '^(.*/(archive)/archive/).*', 'tokens' )
x =
{
[1,1] =
{
[1,1] = Clients/archive/archive/
[1,2] = archive
}
}
  3 件のコメント
Rik
Rik 2023 年 1 月 5 日
I understand it may not be a solution for you, but I just wanted to put it out there in case it solves the issue for someone else.
Reading your comment, I don't believe I have a suggestion you have not thought of.
FM
FM 2023 年 1 月 5 日
That's good. Hopefully it will help someone.

サインインしてコメントする。

その他の回答 (1 件)

FM
FM 2023 年 1 月 5 日
編集済み: FM 2023 年 1 月 7 日
If table "tFolderInSelf" contains a column "Path" consisting of a vertical vector of strings, then the following code truncates the paths after the second consecutive repetition of a folder name:
% Extract the repeated folder names
tFolderInSelf.Folder_x2 = regexp( tFolderInSelf.Path, ...
"\<([\w.-]+)/\1/" , "match", "once" )
% Match the path upto the repeated folder name
tFolderInSelf.PathTrunc = regexp( tFolderInSelf.Path, ...
".*\<"+tFolderInSelf.Folder_x2, "match", "once" );
% Move the match "PathTrunc" next to "Path" for comparison
tFolderInSelf = movevars( ...
tFolderInSelf, "PathTrunc", After="Path" );
% Cleaned-up viewing
categorical( unique( tFolderInSelf.PathTrunc ) )
Clients/archive/archive/
IT/sync/sync/
Knowledge/Bayes/Bayes/
Knowledge/MathPrg/MIPsolveSpd/MIPsolveSpd/
PD/SLT/20220411-0627/20220411/20220411/
PD/SLT/20220411-0627/20220425/20220425/
<...etc...>
Each row of the column tFolderInSelf.PathTrunc is a scalar string. The "regexp" option "Once" ensures that each row has only one element. This allows "regexp" to return an column vector of strings rather than a column vector of cells, as it does not have to accommodate variable length row vectors of strings for each table row.
It is possible that this code can be broken if one of the paths in tFolderInSelf.Path does not contain a repeated folder. In my case, the data set was built using only paths that contain repeated folders.

カテゴリ

Help Center および File ExchangeGet Started with MATLAB についてさらに検索

タグ

製品


リリース

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by