Access to data without reading all of it

Question

Filip 2019 年 5 月 18 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/462883-access-to-data-without-reading-all-of-it

編集済み: per isakson 2019 年 5 月 22 日

Hi all,

I have two data sets. One is10 gb and the other one is 2TB. They are both txt files.

Let me say in the small data set, I have variables timestamp, ID and x. In the big one, I have timestamp, ID and y. Unique number of ID's and timestamps are much higher in the big data set.

For each observation in the small data, I want to find the row with the same milisecond and id in the big data and then copy the value of y to small data.

Is it possible to find corresponding rows without reading 2TB of data?

Thanks

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

per isakson 2019 年 5 月 20 日

編集済み: per isakson 2019 年 5 月 20 日

Now I regard your two data sets as tables and think that you are looking for code to make an inner join (SQL-lingo). I can think of a few different approaches

Read the datasets into a relational database and use SQL. Pro: concepually simple. Con: relational databases are typically not efficient with time series. (No Matlab.)
Loop over all the lines of the 10GB-dataset and search the 2TB-dataset. The search must be fast: binary search or maybe better interpolation search. In the FEX there is Fast Binary Search, Search algorithm to search for values in a sorted array. Written in C (.mex) for speed.(and more), but no interpolation search code. The search of the 2TB-dataset could be based on fgetl() together with fseek() or memmapfile()
Use the data type, table, together with the method innerjoin(). innerjoin() is implemented with m-code and I have no idea regarding performance.

Questions:

Are you looking for a code to use once and forget?
Are the time series of the separate text files of the 2TB-dataset regular (constant timestep)?
Are you forseeing other analyses using the 2TB-dataset?

Filip 2019 年 5 月 20 日

編集済み: per isakson 2019 年 5 月 21 日

MATLAB Online で開く

Responses to Questions:

No.
No.
Yes.

Looks like I will follow this path: creating a database and using SQL queries.

I am slightly confused about the comments about searvh algorithm. Currently I have trial version of the code for a subset of data sets, and I use "find" function of matlab together with @arrayfun. Is find function in Matlab not very efficient? Say my small data is d1 and the big one is d2. I have:

d1.y = arrayfun(@(z) d2.y(find(z>=d2.Tsec, 1,'last' )), d1.Tsec );

so that I get the y value from d2 when timestamp in d1 is for the first time bigger than or equal to timestamp in d2. I have to do this after I get subsets of data with the same ID.

But I do not want to read all 2TB of data since this is not a "use once and forget" code like you asked. Looks like my best bet is to create a database at this point.

Thank you all for help.

Jan 2019 年 5 月 22 日

The find function is very efficient, but not designed for sorted data.

A Bayer-tree https://en.wikipedia.org/wiki/B-tree would be an efficient method to stored sorted data, which do not match into the RAM.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

per isakson 2019 年 5 月 22 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/462883-access-to-data-without-reading-all-of-it#answer_376098

編集済み: per isakson 2019 年 5 月 22 日

MATLAB Online で開く

"I am slightly confused about the comments about search algorithm. [...] Is find function in Matlab not very efficient?"

I'm convinced that the Matlab function, find(), is efficient given the requirements. However, for the special case of a large sorted vector there are faster search algorithms. I made a little demo, which shows that simple m-code implementations of interpolation_search and binary_search are more than an order of magnitude faster than find() - searching a large sorted vector.

%%
A = sort( floor( cumsum( 1+9*rand(1e9,1) ) ) );     % sample data
key = A( 2E8 );
%%
tic; [ixIS,nIS] = interpolation_search( A, key ); toc
tic; [ixBS,nBS] = binary_search( A, key );        toc
tic; ixFL       = find( A==key, 1, 'last' );      toc
tic; ixFF       = find( A==key, 1, 'first' );     toc
%%
[ key ; A([ixIS,ixBS,ixFL,ixFF]) ]

outputs in the command window

Elapsed time is 0.002190 seconds.
Elapsed time is 0.001413 seconds.
Elapsed time is 0.916372 seconds.
Elapsed time is 0.608332 seconds.
ans =
   1.0e+09 *
   1.100021924000000
   1.100021924000000
   1.100021924000000
   1.100021924000000
   1.100021924000000
>> 

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Answer 2

Jan 2019 年 5 月 19 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/462883-access-to-data-without-reading-all-of-it#answer_375732

編集済み: per isakson 2019 年 5 月 20 日

It is a very bad idea to create a huge data set with a size of 2 TB as text files. Do the files have a fixed width for all lines? This would at least allow a binary search in each file. If you search linearily through 2 TB of text files millions of times, you might wait for years to get all results. With a modern data base a processing time in the magnitude of minutes is more likely.

Please blame the person who decided to use text files.

But now you have these files. Of course you can use a binary search: Find the central element of each file. If the row width is fixed, this is trivial. If not, start at the central byte and search for the next (or previous) line break. If the found value if higher than the searched one, use the first half of the file for the next search, or the seconds half in the other case. This requires log2(nData) file accesses only, where nData is the number of data sets.

If the input data is sorted, you can use this information also to refine the search by choosing a start point at the formerly found value.