NetCDF or HDF5 or XYZ to provide time series data at the fingertips of the user

Question

per isakson 2012 年 5 月 5 日

2
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/37524-netcdf-or-hdf5-or-xyz-to-provide-time-series-data-at-the-fingertips-of-the-user

編集済み: per isakson 2015 年 5 月 16 日

MATLAB Online で開く

.

Question: Have I done my homework well enough to choose HDF5 and stop thinking about alternatives?

One more question: Which are the problems with HDF5 that I have overlooked? Will I face unpleasant surprises?

Currently, I store time series data from building automation systems, BAS, in large structures, often named X, in mat-files. Each time series is stored in one field. I use the denomination, Qty, for these timeseries. A typical X has 1000 fields and is 100MB and larger. I have used that "format" for more than ten years. However, I search for something better.

Goals: The user of a visualization tool shall have a huge amount of time series data at the finger tips. Read and write data-files with non-Matlab applications

What I done sofar:

Experimented and used a system based on 128KB memmapfiles. Each time series is stored in a series of memmapfiles. Some metadata is embedded in the filename. It required too much coding and I failed to make it fast enough. Skipped!
Studied some FEX-contributions; Waterloo File and Matrix Utilities; HDS-Toolbox(RNEL-DB); and ... . I share their description of the problem and the goal, but ... and a bit too smart to my capacity.
Googled for NetCDF and HDF; decided to try NetCDF; an experiment with Matlabs high level API (ncwrite, ncread, ...); experienced very poor performance or worse.
Searched in FEX for NetCDF and HDF5. There are 21 and 13 hits, respectively.
A performance test. I used a structure, X, with 1346 fields each holding a .<66528x1 double> time series. The total size of X is 0.7GB. R2012a, Windows7, 64bit. The test included writing the data of the X-structure to the file in question (with X2hdf) and reading the data back to a -structure (with hdf2X). Corresponding functions for NetCDF are nearly identical with "h5" replaced by "nc". With NetCDF, I used the format, netcdf4_classic, and "'Dimensions', { 'qty', len_time }", i.e. a fixed and limited length.

    Execution time in seconds
    --------------------------------------
    Method              write       read       
    HDF5                32.6        2.8
    NetCDF(1)           inf         inf
    save,load(2)        24.4        7.3
    fwrite,fread(3)     3.8         1.3
    read_hdf (FEX)                  3.3
    read_netcdf (FEX)               8.1
    matfile(4)          74          196
    --------------------------------------

the result with NetCFD is strange. "inf" stands for two order of magnitude longer than the corresponding values for HDF5. NetCDF uses ...

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

Sean de Wolski 2012 年 5 月 9 日

That's why it is slow. Using a structure, the _entire_ structure has to be read into memory.

per isakson 2012 年 5 月 9 日

Loading the structure to memory takes 7.3 seconds. However, that is not included in the test of matfile. The structure is loaded beforehand and passed to the function X2matfile.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Sean de Wolski 2012 年 5 月 7 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/37524-netcdf-or-hdf5-or-xyz-to-provide-time-series-data-at-the-fingertips-of-the-user#answer_47018

Have you looked at the MATFILE class in newer ML releases? It allows you the ability to access variables and pieces of variables of a mat-file (hdf5).

This would require creating many variables to be efficient, i.e: each time series would be its own variable, you could store the metadata in the variable name as you described above. I know this is typically frowned upon (a1,a2,...an) but it would give you quick and easy access to what you need.

Just a thought, I may be completely off base and I apolgize if I am.

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

Oleg Komarov 2012 年 5 月 8 日

Or you can create a m by 2 matrix where you concatenate vertically several time series. Then store a master file which retains start and end of each time series. This is basically the approach I would also use with fread/fwrite.

Sean de Wolski 2012 年 5 月 8 日

Yes, Oleg's approach would work well, pad with nans for values you don't have. Store the metadata in a separate matfile or cell array.

サインインしてコメントする。

Answer 2

T. 2013 年 1 月 16 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/37524-netcdf-or-hdf5-or-xyz-to-provide-time-series-data-at-the-fingertips-of-the-user#answer_71305

編集済み: T. 2013 年 1 月 16 日

MATLAB Online で開く

I have also done a of experiments with the performance of netCDF within matlab. Some findings:

The matlab high level functions ncread and ncwrite have some performance issues by design: every command requires matlab to read the header of the netCDF file in order to determine the command to pass to the low level functions netcdf.getVar, netcdf.putVar etc.
The time it takes to read the header of a netCDF file is much greater for netCDF4 (which is HDF5) than for netCDF3, as netCDF3 is much simpler. Also, the complexity of the header increases with the number of variables in a file,; tens is usually workable, hundreds gives a very poor performance.

So to improve netCDF performance, try using version 3 if you can. Otherwise, try calling the low level functions netcdf.xxx instead of the high level functions.

What matlab would need (IMHO) is a high level, built in, object oriented, function to deal with netCDF files. In that function the netCDF file stays open, and the header is cached.

Here is some example code to illustrate the problem

for format = {'classic','netcdf4'}
    fprintf(1,'\nFormat = = %s\n',format{:});
    if exist('test.nc','file')
        delete('test.nc')
    end
    nVars = 100;
    for jj = 0:4
        fprintf(1,'\nvariables = %d\n',nVars * (jj+1));
        for ii = (1:nVars)+nVars * jj
            nccreate('test.nc',sprintf('var%03.0f',ii),...
                'Dimensions',{'r' 400 'c' 1},...
                'Format',format{:});
        end
          for ii = (1:50:nVars)+nVars * jj
              ncwrite('test.nc',sprintf('var%03.0f',ii),reshape(peaks(20),[],1));
          end
          for ii = (1:50:nVars)+nVars * jj
              tic
              ncread('test.nc',sprintf('var%03.0f',ii));
              toc
          end
      end
  end

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Answer 3

Malcolm Lidierth 2013 年 3 月 3 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/37524-netcdf-or-hdf5-or-xyz-to-provide-time-series-data-at-the-fingertips-of-the-user#answer_77175

@Per

I suspect some of the problems with memmapfile might be related to using multiple 128KB memmapfiles. Each requires system resources. The Waterloo File Utilities grew out of the sigTOOL project where I had a similar issue. In that case, each channel was represented by a memmmapfile object, but there might be many hundreds of channels. The "trick" I used was to was to dynamically instantiate the memmapfile instances only on demand (not when the file was first accessed) and to destroy them when not needed. That has allowed sigTOOL uses to work with files of many Gb.

With an HDF5 file, you can still use memory mapping by retrieving the byte offset to your data if:

The data are not chunked
The data are not compressed

This is a limitation of the API rather than the file format I believe, and you could use external mechanisms to break up large data files into separate components leaving HDF5 not knowing about the "chunking" internally and use external compression before writing the data.

My solution in the dev version of sigTOOL is to use a folder, not a file for the data. Each folder, has a few cross-referenced files allowing me to mix *.mat, *.bin, *.hdf5, *.xml etc. It's ugly perhaps, and raises synch issues, but it allows me to take advantage of the best format for different data sets without tying me to their limitations.

Regards ML

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

NetCDF or HDF5 or XYZ to provide time series data at the fingertips of the user

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

回答 (3 件)

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

NetCDF or HDF5 or XYZ to provide time series data at the fingertips of the user

6 件のコメント 4 件の古いコメントを表示4 件の古いコメントを非表示

回答 (3 件)

6 件のコメント 4 件の古いコメントを表示4 件の古いコメントを非表示

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示