How do I find the original indices of a text array after adding new elements?

12 views (last 30 days)

Art on 30 Sep 2025

1
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/2180281-how-do-i-find-the-original-indices-of-a-text-array-after-adding-new-elements

Commented: dpb on 3 Oct 2025

I'm working a project which reads a text file and searches for user-defined phrases. At first glance this sounds rather easy, but due to complexities such as having phrases wrap around lines (with newlines between search words) or having multiple spaces between words, I found I have to search in several different ways.

I do combinations of things like adding spaces and removing newlines/carriage returns from the original text to create an updated text array to search. I keep a record of the indices of added spaces and removed newlines, all relative to the updated text array. I then map each index of the updated array to the index of the original text array using two loops (one for the added spaces, one for the removed newlines). Note that I first remove any newlines, then add spaces to create the updated search text. I believe the order is important.

I use regexp to search the updated text and return findings and starting indices. These starting indices of findings can then be easily mapped to the original text location, which is what I need to output.

My code works, but because of the loops, doing the initial index mapping takes a very long time for large text files (>20min for a 1 Mb file).

I'm hoping someone can help me figure out how to do the array mapping without the loops, maybe with arrayfun or something else.

Here's the relevant mapping loops. Note that SearchText, OriginalText, AddedSpaces and DeletedNewLines are inputs from the calling function.

SearchTextMap = 1:length(SearchText);
for spaceInd = 1:length(AddedSpaces)
    AddedSpaceInd = AddedSpaces(spaceInd);
    SearchTextMap(AddedSpaceInd:end) = SearchTextMap(AddedSpaceInd:end) - 1;
end
for newlineInd = 1:length(DeletedNewLines)
    DeletedNewLineInd = DeletedNewLines(newlineInd);
    SearchTextIndex = find(SearchTextMap == DeletedNewLineInd);
    SearchTextMap(SearchTextIndex:end) = SearchTextMap(SearchTextIndex:end)+1;
end

Any help would be greatly appreciated.

20 Comments
Show 18 older commentsHide 18 older comments

Art on 30 Sep 2025

Open in MATLAB Online

My apologies, I'm developing this on a standalone computer so it's time consuming to copy code/examples here. But I think my question boils down to one specific item:

If I have an array: StartArray = [1 2 3 4 5 6 7 8 9];

and I have a set of indices that need to be added to the array: AddedSpaceInds = [1 6 12 13];

how do I insert the indices specified by AddedSpaceInds while simultaneously shifting the existing data by one index, resulting in a new array: [0 1 2 3 4 4 5 6 7 8 9 9 9].

Note that the value of the DATA at each added index is the (original DATA value of StartArray at that index) - 1, or if the added index value is > the length of StartArray, it's the value of StartArray(end).

I can accomplish this with the following loop, which builds a new array TextMapTemp from scratch, but it's time consuming.

Any ideas to make this loop more efficient or do this without a loop at all?

StartArray      = [1 2 3 4 5 6 7 8 9];
AddedSpaceInds  = [1 6 12 13];
% preallocate a temp array the full array size:
TextMapTemp     = zeros(1,length(StartArray)+length(AddedSpaceInds));
% loop through each temp index, determine its value based on the 
% value of StartArray and whether a space has been added at this index:
StartArrayInd   = 1;
for ind = 1:length(TextMapTemp)
    if ismember(ind, AddedSpaceInds)
        if StartArrayInd > length(StartArray)
            TextMapTemp(ind) = StartArray(end);
        else
            TextMapTemp(ind) = StartArray(StartArrayInd) - 1;
        end
    else
        TextMapTemp(ind) = StartArray(StartArrayInd);
        StartArrayInd    = StartArrayInd + 1;
    end
end 

Stephen23 on 30 Sep 2025

Edited: Stephen23 on 30 Sep 2025

"e.g. regexp only finds one instance of the word "test" in this text (due to the wildcards)"

I guess you mean the metacharacter square brackets.

In any case, if the user is capable of defining a regular expression like that then they must be aware that their pattern matches not just the literal "test" but also two non-letter characters on either side of that literal text. So the match does not miss anything: once "_text_" is matched (where the underscore stands in for any non-letter) then the following "test" in your example does not have any leading non-letter character to match, it has already been consumed by the first match. In short, it does not match because that is exactly what that user requested.

So it appears that you want to override what the user specifies: what are your specific requirements (given that they are not those of the user nor of regular expressions): how do you decide which parts of the user's regular expression to ignore? Which characters or types of characters? In what combinations?

Do the users really know how regular expressions work? Are they aware that their pattern will miss matching the first word? How badly-formed do you want to allow the patterns to be?

Why can the user not just use \< and \> ? Or lookarounds ? Then you could avoid all of this bother.

Would some user pattern hints and input restrictions be a viable route ? The task still seems rather ill-defined.

"Any ideas to make this loop more efficient or do this without a loop at all?"

I don't see any obvious changes that would significantly improve the efficiency.

Stephen23 on 1 Oct 2025

Edited: Stephen23 on 1 Oct 2025

Open in MATLAB Online

What you attempting to write here is a Do-What-I-Want-And-Not-What-I-Tell-You engine: to ignore what the user explicitly tells REGEXP to do and override it with something else. But this gives you two essentially unsolvable problems to solve: 1) unravel what the user explicitly requested, categorising parts of it as wanted and other parts as unwanted, followed by 2) perfom the text matching based on what you believe they wanted...

In your example you essentially want to ignore the explicit request by the user to match and consume one non-letter at both ends of some literal text. But what happens if the user provide this pattern, where they explicitly match two non-letters at both ends:

pat = '[^a-zA-Z]{2}test[^a-zA-Z]{2}'

Would those get ignored too? What about three non-letters? Or ten? It seems to me that this task is ill-defined.

I suspect that restricting what patterns the user can input (either input checking or via some regex-creating tool as dpb suggested) might be a more robust approach.

dpb on 2 Oct 2025

Edited: dpb on 2 Oct 2025

"I do combinations of things like adding spaces and removing newlines/carriage returns from the original text to create an updated text array to search...."

It should be possible to directly calcuate the change in position from the original text location without any looping, or at least without physically rearranging the text in the loop.

For the first substitution/replacement of the newline with a blank, the position is the same if the initial text uses only the newline character so that is known.

If you remove redundant blanks, the length of the text is shortened by the number removed so the location of the newline within the original text is that number past the present.

The location of each substring past each removal is just one more than the prior up to the number in the original line; so the offset to the word after the removal location is the cumsum up to that point of the number removed.

While I'm sure the above would work, the better way would still be to make the search pattern ignore the amount of white space between words in the phrase unless you build in a way for the user to explicitly say to only match identical character strings. But if the idea is context despite formatting, the above should work reasonably simply.

BTW, you may find <Run_Length> by @Jan from FEX useful in the rearrangement parsing. It would start with the input text as a char() vector.

Art on 3 Oct 2025

Open in MATLAB Online

Ok, good points above. I looked into \< \> but I'm not sure they'd work for all user input cases, I'll check them out further.

After trying to understand how to compute the location offsets as stated above, I came up with this (after plumbing through some additional variables like OriginalText):

% Define an initial array the size of the original text array with each original index listed in order: 
OrigTextMap = 1:length(OriginalText);
% Remove any deleted newline indices:
OrigTextMap(DeletedNewLines) = [];
% Initialize the map array the length of the updated text array (after any newlines are removed and spaces added):
TextMap = zeros(size(SearchText));
% Set all TextMap indices that = added space indices to NaNs:
TextMap(AddedSpaces) = NaN;
% Set remaining indices in TextMap to the indices in OrigTextMap. 
NotNamInds  = ~isnan(TextMap); 
TextMap(NotNanInds) = TextMap;

I believe this does what I need without the loops, I just have to account for the NaNs. Thanks for helping me walk through the logic!

dpb on 3 Oct 2025

So how much of a time reduction have you accomplished so far...inquiring minds, and all that! <g>

Answers (0)

Products

MATLAB

Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

How do I find the original indices of a text array after adding new elements?

20 Comments
Show 18 older commentsHide 18 older comments

Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

How do I find the original indices of a text array after adding new elements?

20 Comments Show 18 older commentsHide 18 older comments

Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

20 Comments
Show 18 older commentsHide 18 older comments