How to download multiple files from a website

52 ビュー (過去 30 日間)
Chad Greene
Chad Greene 2023 年 11 月 21 日
コメント済み: Dyuman Joshi 2023 年 11 月 22 日
This question has been asked many times in various ways on this forum, but I've never found a simple answer to this very simple question:
It seems like there should be a two-line solution along the lines of :
url_list = get_urls('https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html','extension','.nc');
websave(url_list)
if get_urls were a function and websave were as easy to use as entering a list of file urls to download and having it save them in the current directory.
  3 件のコメント
Chad Greene
Chad Greene 2023 年 11 月 21 日
Wow, thank you @Dyuman Joshi!
Dyuman Joshi
Dyuman Joshi 2023 年 11 月 22 日
You are welcome!

サインインしてコメントする。

採用された回答

Voss
Voss 2023 年 11 月 21 日
url = 'https://www.ngdc.noaa.gov/thredds/catalog/global/ETOPO2022/15s/15s_surface_elev_netcdf/catalog.html';
% webread() the main page and parse out the links to .nc files:
data = webread(url);
C = regexp(data,'<a href=".*?(\?[^"]*.nc)">','tokens');
temp_urls = strcat(url,vertcat(C{:}));
% webread() each linked url:
data = cell(size(temp_urls));
for ii = 1:numel(temp_urls)
data{ii} = webread(temp_urls{ii});
end
% get the download link in each of those pages:
C = regexp(data,'<a href="([^"]*)">\s*<b>HTTPServer','tokens','once');
% append them to the (sub-)domain of the main URL to get the actual URLs
% for downloading the .nc files:
idx = find(url == '/',3);
nc_urls = strcat(url(1:idx(end)-1),vertcat(C{:}));
% construct file names to save to locally:
[~,filenames,ext] = fileparts(nc_urls);
filenames = strcat(filenames,ext);
% download all the files:
for ii = 1:numel(nc_urls)
websave(filenames{ii},nc_urls{ii});
end
  3 件のコメント
Voss
Voss 2023 年 11 月 21 日
You're welcome!
Each link on the main page goes to a distinct intermediate page which contains the link to download the actual .nc file.
The first webread/regexp gets the set of urls to those intermediate pages. Then webread each of those intermediate pages in a loop, and regexp all the contents to get the download urls (which is the url immediately preceding 'HTTPServer' on each intermediate page - there are several other urls on those pages, and that was the only way I could think of to be sure to get the right one).
Chad Greene
Chad Greene 2023 年 11 月 22 日
Ooh, okay, that makes a lot of sense. Thanks @Voss!

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeDownloads についてさらに検索

製品


リリース

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by