problems with a regex
Show older comments
Hi.
I'm trying to create a regular expression to match and extract some information. Two examples of the source string
example one: 10/0/leaf.nr.0 is a Projection error - touches edge - 3D points.csv
example two: 10/2/leaf.nr.2 is a Projection error - 3D points.csv
I want to extract the string between "is a " and " - touches edge" OR " - 3D" In both example strings this would be "Projection error" but this can be something else.
Currently I have the pattern:
'.*is\sa\s(?<type>.*)(?:\s\-\stouches\sedge)?(?:\s\-\s3D).*.csv'
for example one this returns (not expected):
'Projection error - touches edge'
but for example two it returns(expected):
'Projection error'
IF I change the pattern to:
'.*is\sa\s(?<type>.*)(?:\s\-\stouches\sedge)(?:\s\-\s3D).*.csv'
so I require the (?:\s\-\stouches\sedge) to be matched it returns (correctly):
'Projection error'
for example one but now example two (that dont have the the "touches edge" part ) will not match(of cause).
I dont get why example one also contains the " - touches edge" in the result using the first pattern when I ask it to match this pattern 0 or 1 times.
Any help will be highly appreciated.
Best regards, Thomas
Answers (2)
Muthu Annamalai
on 9 Jul 2013
A simple solution to parse the string with rule
"is a " and ( " - touches edge" OR " - 3D" )
is to use sequential regexp().
That way you know "is a" bit of your source is split out, and then you can search for which of 2 alternatives are present in your case.
Also see the 'NOT' exclusion class operators in regexp, and 'split' mode of regexp.
http://www.mathworks.com/help/matlab/ref/regexp.html
per isakson
on 9 Jul 2013
Edited: per isakson
on 9 Jul 2013
to extract the string between "is a " and the first " - " This formulation is close to a pseudo-code for the expression we search.
ex1 = '10/0/leaf.nr.0 is a Projection error - touches edge - 3D points.csv';
ex2 = '10/2/leaf.nr.2 is a Projection error - 3D points.csv';
regexp( ex1, '(?<=is a )[^\-]+(?= \- )', 'match' )
regexp( ex2, '(?<=is a )[^\-]+(?= \- )', 'match' )
returns
ans =
'Projection error'
ans =
'Projection error'
Search the doc for "Lookaround Assertions" or just "Lookaround". Lookahead Assertions in Regular Expressions
PS. '\-' or just '-' ; a backslash (escape) too many seldom hurts and I've problems to remember when it's needed.
.
OR according to the requirement of the OP
regexp( ex1, '(?<=is a ).+?(?= ((\- touches edge)|(\- 3D)))', 'match' )
regexp( ex2, '(?<=is a ).+?(?= ((\- touches edge)|(\- 3D)))', 'match' )
The extra parentheses, (), makes the expression more readable - imo.
The "?" in ".+?" is the
Lazy expression: match as few characters as necessary.
Categories
Find more on Characters and Strings in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!