Applying regexpi to csv file

Question

Amanda on 2 May 2013

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/74436-applying-regexpi-to-csv-file

I'm learning about regexpi. I have a csv file:

State      Cloudy Sunshine  Rainy
Indiana      52      89          0
California  66      99          1
New York    78      95          1
Indiana      25      52          5
California  45      42          2
Indiana      22      26          5
Indiana      22      89          2

Using regexpi or if there is a better way: I need a data output for only Indiana and Rainy so:

Indiana_Rainy = 0,5,2

So I will be skipping over 2 data columns and 3 commas since it is a csv file.

Thanks,

Amanda

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Cedric on 2 May 2013

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/74436-applying-regexpi-to-csv-file#answer_84246

Edited: Cedric on 2 May 2013

So it is true that your CSV has not comma as delimiter?

The general approach is to use TEXTREAD, TEXTSCAN, FSCAN, etc when you have formatted data, and REGEXP/I when there is a lack of structure/format and you have to go for pattern matching.

If there is really no comma to separate columns in your CSV file, you are a bit in an intermediary situation, as the content looks really well structured, yet we cannot use the white-space as a separator/delimiter as some states names contain white-spaces.

For me, you are left with either pattern matching, or, if the structure of columns is such that there is always 12 characaters in the first column, 8 (7?) in the second, and 13 (?) in the third, you could use FGETL and split the content according to these boundaries.

I can develop both methods for you if you want an example.

Note that if you are interested in pattern matching and to spend ~ a day full time learning regular expressions (I'd say that within 3 hours full time, you would already have a good idea about regexp fundamentals), Mathworks wrote a very good section about them in "Programming Fundamentals" (p.2-26 to 2-86) available here: http://www.mathworks.com/help/pdf_doc/allpdf.html.

4 Comments
Show 2 older commentsHide 2 older comments

Cedric on 2 May 2013

Edited: Cedric on 3 May 2013

Open in MATLAB Online

I missed the part where you mention the commas in your question. So you have commas in the file? If your file looks like

 State,Cloudy,Sunshine,Rainy
 Indiana,52,89,0
 California,66,99,1
 New York,78,95,1
 Indiana,25,52,5
 California,45,42,2
 Indiana,22,26,5
 Indiana,22,89,2

and not like in your question, you can just go for a solution based on a formatted read of some sort, e.g.:

 >> [state,cloud,sunshine,rain] = textread('myData.csv', ...
            '%s %d %d %d', 'delimiter', ',', 'headerlines', 1)
 state = 
    'Indiana'
    'California'
    'New York'
    'Indiana'
    'California'
    'Indiana'
    'Indiana'
 cloud =
    52
    66
    78
    25
    45
    22
    22
 sunshine =
    89
    99
    95
    52
    42
    26
    89
 rain =
     0
     1
     1
     5
     2
     5
     2

Then use STRCMPI to find relevant indices in state:

>> id = strcmpi(state, 'Indiana') ;

which you can use to get relevant rain data

 >> rain(id)
 ans =
     0
     5
     5
     2

If you don't have commas, maybe you have tab characters instead and you can achieve the same thing using

 >> [state,cloud,sunshine,rain] = textread('myData.csv', ...
            '%s %d %d %d', 'delimiter', '\t', 'headerlines', 1)

If none works because you have neither commas nor tabs, we can discuss my first proposals.

Amanda on 3 May 2013

Thanks you so much. I am learning so much. And I will build upon my knowledge.

Cedric on 3 May 2013

Edited: Cedric on 3 May 2013

Open in MATLAB Online

Using REGEXPI, the simplest solution is probably the following:

 >> content = fileread('myData.csv') ;
 >> state = 'Indiana' ; 
 >> pattern = sprintf('(?<=%s(\\s+\\d+){2}\\s+)\\d+', state) ; 
 >> match = regexpi(content, pattern, 'match') ; 
 >> rain = str2double(match)
 rain =
     0     5     5     2

Here the pattern matches from 1 to as many numeric characters as possible \d+ preceded by (positive look behind: (?<=expression)) an expression made of the state name (literal) followed by twice (expression){2} the expression/pattern defined by \s+\d+, meaning from 1 to as many white space as possible and from 1 to as many numeric characters as possible, and again from 1 to as many white spaces as possible \s+. This is a way to match values in each column and the variable amount of white spaces that follow.

The \ character being a special character in the format spec of SPRINTF, it must be escaped with an additional \, which explains the \\. Patterns outputted by SPRINTF have, however, the following structure:

'(?<=Indiana(\s+\d+){2}\s+)\d+'

or equivalently (avoiding the repetition with {2}):

'(?<=Indiana\s+\d+\s+\d+\s+)\d+'

EDIT: as mentioned earlier, structured file content is usually read using TEXTREAD/SCAN, FSCANF, etc, so you should focus on these functions first. If there is some default in the structure (e.g. no comma and unable to use space as delimiter because a column contains spaces in the values), you can generally recover the situation using REGEXP/I. Regular expressions are The most powerful tool, available in most serious languages, for performing pattern matching/replacement in text. If you think that you will have a significant amount of pattern matching to perform in your field/career, it is really worth investing a day full time to get enough mastery so you at least know what resources they can provide to you. You won't become a regexp guru within a day, but I can guarantee that you would be able to understand a practice enough with most of the material that is explained in the document mentioned earlier.

Sign in to comment.

Applying regexpi to csv file

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

4 Comments
Show 2 older commentsHide 2 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Applying regexpi to csv file

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

4 Comments Show 2 older commentsHide 2 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

4 Comments
Show 2 older commentsHide 2 older comments