How can I specify position and exclude repeated characters using regexp?
3 views (last 30 days)
Show older comments
In searching gene sequence data, I want to find sequences that have the form NGGGNGGGN
where N = A, C, T, or G, in any order, of length 1-7. However, I do not want to find N with repeated G, for example I don't want N = GG, AGGA, AGGGA. I want to only find N that includes G but does not have consecutive G like GG, and I don't want to find N where G is first or last such that the GGG would be extended by the presence of the G in N.
I want to use something like expr = 'G{3}[ACTG]{1-7}(?!GG)G{3}' but MatLab does not like this. I'm not very good with conditions in regexp, or regexp in general. Any help is appreciated.
0 Comments
Answers (1)
Nitin Khola
on 5 Nov 2015
Edited: Walter Roberson
on 6 Nov 2015
Thanks for providing a detailed question.
From what I understand, I think you are just looking for sequences that have only one repeated pattern for G's i.e. "GGG". Anything else besides this pattern is unwanted. So I thought we could just do a "strfind" http://www.mathworks.com/help/matlab/ref/strfind.html to look for a pattern of "GGG". If you go through the documentation link I provided, you will notice how "strfind" will return the values of starting indices for the pattern it is searching for. These indices will be helpful in eliminating the sequences of the form that have "AGGGA", for example, in N. So the idea is simple, first do an "strfind" and locate indices for each string that has the "GGG" pattern. Second, eliminate sequences with indices that are not allowed, for example, only the indices of 7 and 16 correspond to valid indices if the length of N is 6. You can even come up with a formula for the "valid sequence indices". For example, length of N = (total sequence length - 6)/3. Valid indices for "GGG" pattern = (length of N + 1) and (2*length of N + 3 + 1) etc. I apologize in advance, if I have committed any arithmetic errors in providing the above example formula.
Also, you may need to loop through your data for this or if all of your data is stored in a cell array, you can take the shorter route of using "cellfun" http://www.mathworks.com/help/matlab/ref/cellfun.html.
Have fun!
0 Comments
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!