Lookahead Assertions in Regular Expressions
Lookahead Assertions
There are two types of lookaround assertions for regular expressions: lookahead and lookbehind. In both cases, the assertion is a condition that must be satisfied to return a match to the expression.
A lookahead assertion has the form (?=test)
and can appear anywhere in a regular expression. MATLAB® looks ahead of the current location in the text for the test condition. If MATLAB matches the test condition, it continues processing the rest of the expression to find a match.
For example, look ahead in a character vector specifying a path to find the name of the folder that contains a program file (in this case, fileread.m
).
chr = which('fileread')
chr = 'matlabroot\toolbox\matlab\iofun\fileread.m'
regexp(chr,'\w+(?=\\\w+\.[mp])','match')
ans = 1×1 cell array {'iofun'}
The match expression, \w+
, searches for one or more alphanumeric or underscore characters. Each time regexp
finds a term that matches this condition, it looks ahead for a backslash (specified with two backslashes, \\
), followed by a file name (\w+
) with an .m
or .p
extension (\.[mp]
). The regexp
function returns the match that satisfies the lookahead condition, which is the folder name iofun
.
Overlapping Matches
Lookahead assertions do not consume any characters in the text. As a result, you can use them to find overlapping character sequences.
For example, use lookahead to find every sequence of six nonwhitespace characters in a character vector by matching initial characters that precede five additional characters:
chr = 'Locate several 6-char. phrases'; startIndex = regexpi(chr,'\S(?=\S{5})')
startIndex = 1 8 9 16 17 24 25
The starting indices correspond to these phrases:
Locate severa everal 6-char -char. phrase hrases
Without the lookahead operator, MATLAB parses a character vector from left to right, consuming the vector as it goes. If matching characters are found, regexp
records the location and resumes parsing the character vector from the location of the most recent match. There is no overlapping of characters in this process.
chr = 'Locate several 6-char. phrases'; startIndex = regexpi(chr,'\S{6}')
startIndex = 1 8 16 24
The starting indices correspond to these phrases:
Locate severa 6-char phrase
Logical AND Conditions
Another way to use a lookahead operation is to perform a logical AND
between two conditions. This example initially attempts to locate all lowercase consonants in a character array consisting of the first 50 characters of the help for the normest
function:
helptext = help('normest');
chr = helptext(1:50)
chr = ' NORMEST Estimate the matrix 2-norm. NORMEST(S'
Merely searching for non-vowels ([^aeiou]
) does not return the expected answer, as the output includes capital letters, space characters, and punctuation:
c = regexp(chr,'[^aeiou]','match')
c = 1×43 cell array Columns 1 through 14 {' '} {'N'} {'O'} {'R'} {'M'} {'E'} {'S'} {'T'} {' '} {'E'} {'s'} {'t'} {'m'} {'t'} Columns 15 through 28 {' '} {'t'} {'h'} {' '} {'m'} {'t'} {'r'} {'x'} {' '} {'2'} {'-'} {'n'} {'r'} {'m'} Columns 29 through 42 {'.'} {'↵'} {' '} {' '} {' '} {' '} {'N'} {'O'} {'R'} {'M'} {'E'} {'S'} {'T'} {'('} Column 43 {'S'}
Try this again, using a lookahead operator to create the following AND
condition:
(lowercase letter) AND (not a vowel)
This time, the result is correct:
c = regexp(chr,'(?=[a-z])[^aeiou]','match')
c = 1×13 cell array {'s'} {'t'} {'m'} {'t'} {'t'} {'h'} {'m'} {'t'} {'r'} {'x'} {'n'} {'r'} {'m'}
Note that when using a lookahead operator to perform an AND
, you need to place the match expression expr
after the test expression test
:
(?=test)expr or (?!test)expr