フィルターのクリア

Perform Google Search in Matlab

43 ビュー (過去 30 日間)
dsmalenb
dsmalenb 2019 年 6 月 4 日
回答済み: DGM 2024 年 9 月 18 日
Hi!
I am trying to figure out how to perform a Google search automatically in matlab and save the results in an array.
Say I wanted to save the paths to the pdf files: "site:www.cnn.com filetype:pdf"
Some answers in the list should then be:
...
I have seen some scripts (links below) but unfortunately they are outdated or simply do not work. I am guessing it may be possible to do this but I cannot seem to figure it out. Any assistance would be very welcome!
Links:
  3 件のコメント
dsmalenb
dsmalenb 2019 年 6 月 4 日
Joel,
Thank you for your response. Perhaps I am missing something significant but after parsing through the html I tried to compare the parts so I can made the neccesary changes. However, it does not seem as if all the necessary parts of the link are available. I have included an example below. It is for the first arciel that the search displays.
We have:
  1. The file typoe is in GREEN
  2. The Article's title is in YELLOW
  3. The parts of the link are in MAGENTA
I am missing "2004" and "01/23/" to complete the link. These parts do not seem to be listed in the HTML code.
Any idea how to get these pieces?
snippet.jpg
Joel Handy
Joel Handy 2019 年 6 月 10 日
After doing some more research, it looks like scraping (thats what we are doing, scraping googles search results) is against their terms of service and they actively attempt to thwart it. That would explain why some older tools are no longer maintained. I'm not a web expert, There appear to be ways of doing what you want but I dont think any of them are simple.
Sorry I couldnt be more help.

サインインしてコメントする。

回答 (3 件)

Monika Phadnis
Monika Phadnis 2019 年 6 月 27 日
I followed the example given on this link to extract data from the url.
As for the url, I used " http://www.google.com/search?q=cnn.com+filetype%3Apdf " this as the url parameter for webread for the example given by you. This gives string array of the href links, you can try parsing the array for the required links.
In my output strings starting with " /url " had the search links.

KARTIK GURNANI
KARTIK GURNANI 2020 年 5 月 21 日
This Does seem true.
Ps :
Microsoft introduced this feature to prevent Other Web engines from copying their data {Search Results } on Bing way before Google.
It seems like we would be violating TOS on google and bing .
I tried.
I got Partial Results.
The best possible way would be to use Matlab to build a Neural Network which Runs search Querries from a system with Dynamic IP.
@AndrewNg might shed some better light on this.
There is a possible solution to this .
But , the Biggest issue of it all :
Google and Bing {Microsoft} might label your ip address as spam or bot .
Which Means , No netflix , No Hulu , No other streaming Service.
You might get locked out of Even Reading News on certain websites.
Hell , even simple web searches you might end up solving Recaptcha or the Newer Version : ImageCaptcha.
Dynamic IP will help in this case but check with your ISP before attempting this.
You might lose the Security or your Plan may get suspended .
>>It will take the ISP a lot of man hours to get that single IP cleaned up : Removed from Blacklist across most filters.
>>You would mostly increase their headache.
##
Note :
I have created a matlab script that can work your search querry.
I am not sure about posting it here.
The issue being you can only run it :
Single Search Query
It works but crawling takes a while , then use of postcript to convert to pdf .
Better when saving to HTML file with images.
If anyone would like the script , please let me know.
The script is only for educational terms.
Do not use it to violate TOS of any organization.
Good Luck & Stay Safe,
Kartik
  2 件のコメント
David Chen
David Chen 2020 年 5 月 27 日
編集済み: David Chen 2020 年 5 月 27 日
"If anyone would like the script , please let me know."
I want.
Dwan Andrés Mahecha Vallejo
Dwan Andrés Mahecha Vallejo 2024 年 9 月 17 日
Por favor

サインインしてコメントする。


DGM
DGM 2024 年 9 月 18 日
Here's a basic example. I'm pretty sure there are other ways of doing this, but the docs are a confusing maze. Last I checked, DDG's API wasn't even complete enough to be useful for anything.
% your query string
query = '+site:www.cnn.com banana';
% your google custom search key, etc
% https://developers.google.com/custom-search/v1/overview
% https://developers.google.com/custom-search/v1/introduction
% https://developers.google.com/custom-search/docs/tutorial/creatingcse
% free accounts are limited to 10 results per query, 100 queries per day
% there are also rate limits
apikey = 'your_key_goes_here'; % API key
cx = 'your_cx_goes_here'; % CSE identifier
% search setup
wopt = weboptions('contenttype','json');
url = ['https://customsearch.googleapis.com/customsearch/v1?cx=' cx '&key=' apikey '&q=' query '&num=10'];
% try to perform the search
try
S = webread(url,wopt);
catch
% this might also happen if API call is broken somehow
fprintf('Connection error. Web search failed.\n')
return;
end
% extract the urls
if isfield(S,'items')
items = S.items;
% depending on the results, items is either a struct array
% or a cell array of dissimilar structs
if isstruct(items)
urllist = {items.link}.';
else
urllist = cellfun(@(x) x.link,items,'uniform',false);
end
else
fprintf('No results.\n')
return;
end
urllist
urllist = 10x1 cell array
{'https://www.cnn.com/2020/05/02/health/banana-bread-pandemic-baking-wellness-trnd/index.html' } {'https://www.cnn.com/2020/02/22/us/banana-label-collector-becky-martz-trnd/index.html' } {'https://www.cnn.com/2024/03/25/business/trader-joes-banana-price-increase/index.html' } {'https://www.cnn.com/style/article/student-eats-maurizio-cattelan-banana-art-south-korea-intl-hnk/index.html'} {'https://www.cnn.com/travel/article/banana-island-qatar/index.html' } {'https://www.cnn.com/style/article/david-datuna-banana-art-basel-trnd/index.html' } {'https://www.cnn.com/2016/10/25/health/banana-extinction/index.html' } {'https://www.cnn.com/2015/07/22/africa/banana-panama-disease/index.html' } {'https://www.cnn.com/style/article/banana-artwork-eaten-scli-intl/index.html' } {'https://www.cnn.com/2021/12/09/entertainment/the-masked-singer-reveal/index.html' }

カテゴリ

Help Center および File ExchangeGoogle についてさらに検索

製品


リリース

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by