How can I remove websites' links from a text?
7 views (last 30 days)
Show older comments
I am trying to remove websites' links from a string. I would like to remove (or replace with a space ' ') every link that starts with 'https:'. I tried using the command regexprep, but I am able to replace only a specific link.
1 Comment
Jan
on 1 Feb 2017
Please post some relevant part of the text. Is the "https:" included in < and > or in double quotes? Can spaces appear in the links?
Answers (2)
Iddo Weiner
on 1 Feb 2017
Edited: Iddo Weiner
on 1 Feb 2017
Dario, this really depends on what your data looks like. BUT I made an assumption regarding what your text might look like, please check out the following method:
text = 'some words https:link some other words https:otherlink final words';
disp(text)
some words https:link some other words https:otherlink final words
text_copy = text; % work on a copy so you always have the original for comparison
base_string = 'https:';
first_del_idx = strfind(text, base_string); %this is where the link string starts
% find the paired last index for each first index
last_del_idx = nan(size(first_del_idx));
for i = (length(last_del_idx)):-1:1 %the loop works "backwards"
next_idx = first_del_idx(i) + length(base_string); %no point in checking before this point
while true
if strcmp(text_copy(next_idx),' ')==1 || strcmp(text_copy(next_idx),'\'); %guard aginast the possibility of a link in the end of a line
last_del_idx(i) = next_idx;
text_copy(first_del_idx(i) : last_del_idx(i)) = []; %this is the actual deletion
break %out of the while loop
end
next_idx = next_idx + 1;
end
end
% let's see what we're left with
disp(text_copy)
some words some other words final words
Explanation: You might need to adjust a few things in your code, so here's the logic - I assumed you have a base string which could be used to find all link occurrences. I also assumed that links are written without spaces and that a space indicates the end of a link - so if you start running from "https:" and stop when you bump into a space (' '), then you found the full length of the substring that is to be deleted. Now if this is not the situation, you will need a different identifier for the end of a link, maybe '.com' or '/' - I can't know this for sure without seeing your data. There is at least 1 edge-case I could think of that could create bugs in my code - what if the link is at the end of row? In that case instead of ending with a space, it would end with a backslash '\' which would be part of a \n which signifies the beginning of a new line. So I added a condition to protect against this, but then again - your data may not have \n at the end of lines and then we'd have to think of a different identifier for these cases.
There are some principles I highlighted here that might be a little confusing - working with a copy (and not on the original data) is a good coding practice.. And I'd recommend traversing the string backwards so while erasing you don't mix-up the indices, which can cause all kinds of unwanted bugs.
I hope this helps
p.s. I worked here with strfind(), but you could substitute it with regular expression based functions, such as regexp() if you prefer. It's essentially the same in this case.
0 Comments
Christopher Creutzig
on 2 Nov 2017
Based on your description, the following should work, which uses \S8, the regex notation for “arbitrarily many not whitespace”:
regexprep(str,'https:\S*','')
0 Comments
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!