How do I find the original indices of a text array after adding new elements?

81 views (last 30 days)
I'm working a project which reads a text file and searches for user-defined phrases. At first glance this sounds rather easy, but due to complexities such as having phrases wrap around lines (with newlines between search words) or having multiple spaces between words, I found I have to search in several different ways.
I do combinations of things like adding spaces and removing newlines/carriage returns from the original text to create an updated text array to search. I keep a record of the indices of added spaces and removed newlines, all relative to the updated text array. I then map each index of the updated array to the index of the original text array using two loops (one for the added spaces, one for the removed newlines). Note that I first remove any newlines, then add spaces to create the updated search text. I believe the order is important.
I use regexp to search the updated text and return findings and starting indices. These starting indices of findings can then be easily mapped to the original text location, which is what I need to output.
My code works, but because of the loops, doing the initial index mapping takes a very long time for large text files (>20min for a 1 Mb file).
I'm hoping someone can help me figure out how to do the array mapping without the loops, maybe with arrayfun or something else.
Here's the relevant mapping loops. Note that SearchText, OriginalText, AddedSpaces and DeletedNewLines are inputs from the calling function.
SearchTextMap = 1:length(SearchText);
for spaceInd = 1:length(AddedSpaces)
AddedSpaceInd = AddedSpaces(spaceInd);
SearchTextMap(AddedSpaceInd:end) = SearchTextMap(AddedSpaceInd:end) - 1;
end
for newlineInd = 1:length(DeletedNewLines)
DeletedNewLineInd = DeletedNewLines(newlineInd);
SearchTextIndex = find(SearchTextMap == DeletedNewLineInd);
SearchTextMap(SearchTextIndex:end) = SearchTextMap(SearchTextIndex:end)+1;
end
Any help would be greatly appreciated.
  14 Comments
Stephen23
Stephen23 ongeveer 6 uur ago
Edited: Stephen23 ongeveer 3 uur ago
What you attempting to write here is a Do-What-I-Want-And-Not-What-I-Tell-You engine: to ignore what the user explicitly tells REGEXP to do and override it with something else. But this gives you two essentially unsolvable problems to solve: 1) unravel what the user explicitly requested, categorising parts of it as wanted and other parts as unwanted, followed by 2) perfom the text matching based on what you believe they wanted...
In your example you essentially want to ignore the explicit request by the user to match and consume one non-letter at both ends of some literal text. But what happens if the user provide this pattern, where they explicitly match two non-letters at both ends:
pat = '[^a-zA-Z]{2}test[^a-zA-Z]{2}'
Would those get ignored too? What about three non-letters? Or ten? It seems to me that this task is ill-defined.
I suspect that restricting what patterns the user can input (either input checking or via some regex-creating tool as dpb suggested) might be a more robust approach.
dpb
dpb ongeveer 16 uur ago
Edited: dpb ongeveer 12 uur ago
"...doing the initial index mapping takes a very long time for large text files"
As Knuth pointed out during a colloquium at ORNL many years ago when asked a similar question about a particular problem a lab employee had, the most time saving code construct is to not do what isn't needed.
@Stephen23 has a very good characteerization of the problem, in concert with that, if one is adamant about doing a more general search trying to account for possible irregularities in the text being searched, it would be far more efficient to make changes in the search pattern rather than modifiy the large text.

Sign in to comment.

Answers (0)

Categories

Find more on Characters and Strings in Help Center and File Exchange

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!