フィルターのクリア

How to read a website and download pdf files

25 ビュー (過去 30 日間)
Joy Shen
Joy Shen 2023 年 4 月 3 日
回答済み: Voss 2023 年 8 月 23 日
Is there a way in matlab to essentially batch read a website and download the pdfs without having the specific pdf url?
For example, on the websites I'm looking through, each page has the following type of link where you download the pdf: https://restservice.epri.com/publicdownload/000000000001013457/0/Product
I just have an excel spreadsheet of all the URLs I want to look through, not the PDF file URLs.

回答 (2 件)

Saffan
Saffan 2023 年 4 月 5 日
It can be done using “webread” and “websave” functions.
Here is an example code snippet to extract all the URLs present in a particular webpage:
%extract entire source code of the page
html_text = webread(url);
%extracts URLs present in the source code
all_urls = regexp(html_text,'https?://[^"]+','match');
Once you have obtained the URLs of the downloadable PDFs, you can use the "websave" function to download them. Here is an example code snippet to demonstrate this:
websave('filename.pdf',pdf_url);
  1 件のコメント
Joy Shen
Joy Shen 2023 年 5 月 1 日
編集済み: Joy Shen 2023 年 5 月 1 日
How do I do a batch download though? For example, I have an excel spreadsheet of all the links and the names of the pdf title. I sorted through and stored all the links that match my string. How do I get webread and websave to open each of the links in my excel spreadsheet (which is a 64x1 table) and download the pdf? Especially when the link to the pdf doesn't seem to lead to a .pdf, it leads to a link like this: https://restservice.epri.com/publicdownload/000000000001013457/0/Product
URL = readtable('EPRI NMAC Repository.xlsx','Range','I2:K638'); % Load excel from specified sheet
substr = 'Nuclear Maintenance Applications Center';
name = table2array(URL(:,1));
selectedcol = contains(name,substr);
links = URL(:,3);
selectedlinks= links(selectedcol,:)
%extract entire source code of the page
html_text = webread(selectedlinks);

サインインしてコメントする。


Voss
Voss 2023 年 8 月 23 日
URL = readtable('EPRI NMAC Repository.xlsx','Range','I2:K638'); % Load excel from specified sheet
substr = 'Nuclear Maintenance Applications Center';
selectedrow = contains(URL{:,1},substr);
pdf_url = URL{selectedrow,3};
% make valid file names from the urls:
pdf_fn = regexprep(pdf_url,'[/\\.*<>|?"]','_');
pdf_fn = strcat(pdf_fn,'.pdf');
% download the pdf files:
for ii = 1:numel(pdf_url)
websave(pdf_fn{ii},pdf_url{ii});
end

カテゴリ

Help Center および File ExchangeDownloads についてさらに検索

製品


リリース

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by