How can I remove websites' links from a text?
12 ビュー (過去 30 日間)
I am trying to remove websites' links from a string. I would like to remove (or replace with a space ' ') every link that starts with 'https:'. I tried using the command regexprep, but I am able to replace only a specific link.
回答 (2 件)
Iddo Weiner 2017 年 2 月 1 日
編集済み: Iddo Weiner 2017 年 2 月 1 日
Dario, this really depends on what your data looks like. BUT I made an assumption regarding what your text might look like, please check out the following method:
text = 'some words https:link some other words https:otherlink final words';
some words https:link some other words https:otherlink final words
text_copy = text; % work on a copy so you always have the original for comparison
base_string = 'https:';
first_del_idx = strfind(text, base_string); %this is where the link string starts
% find the paired last index for each first index
last_del_idx = nan(size(first_del_idx));
for i = (length(last_del_idx)):-1:1 %the loop works "backwards"
next_idx = first_del_idx(i) + length(base_string); %no point in checking before this point
if strcmp(text_copy(next_idx),' ')==1 || strcmp(text_copy(next_idx),'\'); %guard aginast the possibility of a link in the end of a line
last_del_idx(i) = next_idx;
text_copy(first_del_idx(i) : last_del_idx(i)) = ; %this is the actual deletion
break %out of the while loop
next_idx = next_idx + 1;
% let's see what we're left with
some words some other words final words
Explanation: You might need to adjust a few things in your code, so here's the logic - I assumed you have a base string which could be used to find all link occurrences. I also assumed that links are written without spaces and that a space indicates the end of a link - so if you start running from "https:" and stop when you bump into a space (' '), then you found the full length of the substring that is to be deleted. Now if this is not the situation, you will need a different identifier for the end of a link, maybe '.com' or '/' - I can't know this for sure without seeing your data. There is at least 1 edge-case I could think of that could create bugs in my code - what if the link is at the end of row? In that case instead of ending with a space, it would end with a backslash '\' which would be part of a \n which signifies the beginning of a new line. So I added a condition to protect against this, but then again - your data may not have \n at the end of lines and then we'd have to think of a different identifier for these cases.
There are some principles I highlighted here that might be a little confusing - working with a copy (and not on the original data) is a good coding practice.. And I'd recommend traversing the string backwards so while erasing you don't mix-up the indices, which can cause all kinds of unwanted bugs.
I hope this helps
p.s. I worked here with strfind(), but you could substitute it with regular expression based functions, such as regexp() if you prefer. It's essentially the same in this case.