regexprep: Nested ordinal token not captured

2 views (last 30 days)
FM
FM on 5 Jan 2023
Edited: FM on 7 Jan 2023
I am trying to modify file paths with consecutive repeated folder names, e.g, "archive" is repeated in "Clients/archive/archive/20220428.1349.zip". The modification I seek is to truncate that path beyond the 2nd occurance of a repeated folder, leaving the trailing file path separator, e.g., "Clients/archive/archive/". I thought this would do it:
FolderInSelf = regexprep( FolderInSelf, ...
"^(.*/(\w+)/\2/).*", "$1" );
"FolderInSelf" is vertical vector of strings, each representing a file path that contains a consecutively repeated folder name.
The outer set of brackets captures the 1st token, which is for the path upto the repeated folder, excluding anything after the slash.
The inner set of brackets is the 2nd token, which is the for the first occurrence of the repeated folder name ("archive" in the example above).
The back reference "\2" describes the fact that the token is repeated, and separated by a slash.
I am puzzled by why the above "regexprep" does nothing to the strings in FolderInSelf. To troubleshoot, I chose a simpler command that worked as expected
>> regexprep( "Clients/archive/archive/20220428.1349.zip", ...
"^(.*/(archive)/archive/).*", "$1" )
ans = "Clients/archive/archive/"
If I replace "$1" with "$2", I expect to get "archive" (the 2nd token). Instead, I get:
ans = "$2"
This suggest that the 2nd token is not being captured. Can anyone point out what I am doing wrong?
  1 Comment
FM
FM on 5 Jan 2023
Edited: FM on 5 Jan 2023
If you don't mind posting this as the answer, I'll mark it as answered.
This is quite a severe limitation in regular expressions. :(

Sign in to comment.

Accepted Answer

Rik
Rik on 5 Jan 2023
Edited: Rik on 5 Jan 2023
I'm not entirely sure tokens can be nested (at least in the implementation that Matlab uses).
You can also explore the output of your tokens first with regexp:
regexp( "Clients/archive/archive/20220428.1349.zip", ...
"^(.*/(archive)/archive/).*", "tokens" )
ans = 1×1 cell array
{["Clients/archive/archive/"]}
I suspect the inner parentheses are considered grouping, not token-capturing.
I just tested this on the oldest Matlab I can run (v6.5 from 2002, which requires a bit of trickery to extract the tokens), and there the result is the same as in the online editor. So the remarks from the thread you found hold for just about any release of Matlab you can still get to run.
I might interest you to know that the output on GNU Octave (a mostly-compatible software suite) is not the same:
x=regexp( 'Clients/archive/archive/20220428.1349.zip', '^(.*/(archive)/archive/).*', 'tokens' )
x =
{
[1,1] =
{
[1,1] = Clients/archive/archive/
[1,2] = archive
}
}
  3 Comments
Rik
Rik on 5 Jan 2023
I understand it may not be a solution for you, but I just wanted to put it out there in case it solves the issue for someone else.
Reading your comment, I don't believe I have a suggestion you have not thought of.
FM
FM on 5 Jan 2023
That's good. Hopefully it will help someone.

Sign in to comment.

More Answers (1)

FM
FM on 5 Jan 2023
Edited: FM on 7 Jan 2023
If table "tFolderInSelf" contains a column "Path" consisting of a vertical vector of strings, then the following code truncates the paths after the second consecutive repetition of a folder name:
% Extract the repeated folder names
tFolderInSelf.Folder_x2 = regexp( tFolderInSelf.Path, ...
"\<([\w.-]+)/\1/" , "match", "once" )
% Match the path upto the repeated folder name
tFolderInSelf.PathTrunc = regexp( tFolderInSelf.Path, ...
".*\<"+tFolderInSelf.Folder_x2, "match", "once" );
% Move the match "PathTrunc" next to "Path" for comparison
tFolderInSelf = movevars( ...
tFolderInSelf, "PathTrunc", After="Path" );
% Cleaned-up viewing
categorical( unique( tFolderInSelf.PathTrunc ) )
Clients/archive/archive/
IT/sync/sync/
Knowledge/Bayes/Bayes/
Knowledge/MathPrg/MIPsolveSpd/MIPsolveSpd/
PD/SLT/20220411-0627/20220411/20220411/
PD/SLT/20220411-0627/20220425/20220425/
<...etc...>
Each row of the column tFolderInSelf.PathTrunc is a scalar string. The "regexp" option "Once" ensures that each row has only one element. This allows "regexp" to return an column vector of strings rather than a column vector of cells, as it does not have to accommodate variable length row vectors of strings for each table row.
It is possible that this code can be broken if one of the paths in tFolderInSelf.Path does not contain a repeated folder. In my case, the data set was built using only paths that contain repeated folders.

Categories

Find more on Get Started with MATLAB in Help Center and File Exchange

Tags

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!