Applying regexpi to csv file

4 views (last 30 days)
I'm learning about regexpi. I have a csv file:
State Cloudy Sunshine Rainy
Indiana 52 89 0
California 66 99 1
New York 78 95 1
Indiana 25 52 5
California 45 42 2
Indiana 22 26 5
Indiana 22 89 2
Using regexpi or if there is a better way: I need a data output for only Indiana and Rainy so:
Indiana_Rainy = 0,5,2
So I will be skipping over 2 data columns and 3 commas since it is a csv file.

Accepted Answer

Cedric Wannaz
Cedric Wannaz on 2 May 2013
Edited: Cedric Wannaz on 2 May 2013
So it is true that your CSV has not comma as delimiter?
The general approach is to use TEXTREAD, TEXTSCAN, FSCAN, etc when you have formatted data, and REGEXP/I when there is a lack of structure/format and you have to go for pattern matching.
If there is really no comma to separate columns in your CSV file, you are a bit in an intermediary situation, as the content looks really well structured, yet we cannot use the white-space as a separator/delimiter as some states names contain white-spaces.
For me, you are left with either pattern matching, or, if the structure of columns is such that there is always 12 characaters in the first column, 8 (7?) in the second, and 13 (?) in the third, you could use FGETL and split the content according to these boundaries.
I can develop both methods for you if you want an example.
Note that if you are interested in pattern matching and to spend ~ a day full time learning regular expressions (I'd say that within 3 hours full time, you would already have a good idea about regexp fundamentals), Mathworks wrote a very good section about them in "Programming Fundamentals" (p.2-26 to 2-86) available here:
Cedric Wannaz
Cedric Wannaz on 3 May 2013
Using REGEXPI, the simplest solution is probably the following:
>> content = fileread('myData.csv') ;
>> state = 'Indiana' ;
>> pattern = sprintf('(?<=%s(\\s+\\d+){2}\\s+)\\d+', state) ;
>> match = regexpi(content, pattern, 'match') ;
>> rain = str2double(match)
rain =
0 5 5 2
Here the pattern matches from 1 to as many numeric characters as possible \d+ preceded by (positive look behind: (?<=expression)) an expression made of the state name (literal) followed by twice (expression){2} the expression/pattern defined by \s+\d+, meaning from 1 to as many white space as possible and from 1 to as many numeric characters as possible, and again from 1 to as many white spaces as possible \s+. This is a way to match values in each column and the variable amount of white spaces that follow.
The \ character being a special character in the format spec of SPRINTF, it must be escaped with an additional \, which explains the \\. Patterns outputted by SPRINTF have, however, the following structure:
or equivalently (avoiding the repetition with {2}):
EDIT: as mentioned earlier, structured file content is usually read using TEXTREAD/SCAN, FSCANF, etc, so you should focus on these functions first. If there is some default in the structure (e.g. no comma and unable to use space as delimiter because a column contains spaces in the values), you can generally recover the situation using REGEXP/I. Regular expressions are The most powerful tool, available in most serious languages, for performing pattern matching/replacement in text. If you think that you will have a significant amount of pattern matching to perform in your field/career, it is really worth investing a day full time to get enough mastery so you at least know what resources they can provide to you. You won't become a regexp guru within a day, but I can guarantee that you would be able to understand a practice enough with most of the material that is explained in the document mentioned earlier.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!