How to only follow certain links with matlab spider
2 ビュー (過去 30 日間)
古いコメントを表示
Hi I am struggling with only allowing certain urls to be followed when using a spider to build a web graph. Basically I only want the spider to follow links that point to the uni server( shef.ac.uk), any other urls need to be discarded, opposed to the current state were all links are followed. Probably quiet a simple fix.
U = cell(n,1);
hash = zeros(n,1);
L = logical(sparse(n,n));
m = 1;
U{m} = root;
hash(m) = hashfun(root);
for j = 1:n
try
disp(['open ' num2str(j) ' ' U{j}])
page = urlread(U{j});
catch
disp(['fail ' num2str(j) ' ' U{j}])
continue
end
for f = findstr('http:',page);
e = min(findstr('"',page(f:end)));
if isempty(e), continue, end
url = deblank(page(f:f+e-2));
url(url<' ') = '!';
if url(end) == '/', url(end) = []; end
skips = {'.gif','.jpg','.ico'};
skip = any(url=='!') | any(url=='?');
k=0;
while ~skip && (k < length(skips))
k = k+1;
skip = ~isempty(findstr(url,skips{k}));
end
if skip
if isempty(findstr(url,'.gif')) & isempty(findstr(url,'.jpg'))
disp([' skip' url])
end
continue
end
i=0;
for k = find(hash(1:m) == hashfun(url))';
if isequal(U{k},url)
i = k;
break
end
end
if (i == 0) & (m < n)
m = m+1;
U{m} = url;
hash(m) = hashfun(url);
i=m;
end
if i > 0
disp([' link ' int2str(i) ' ' url])
L(i,j) = 1;
end
end
end
0 件のコメント
回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Antennas and Electromagnetic Propagation についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!