MATLAB Answers

Filip
0

Access to data without reading all of it

Filip
さんによって質問されました 2019 年 5 月 18 日
最新アクティビティ per isakson
さんによって 編集されました 2019 年 5 月 22 日
Hi all,
I have two data sets. One is10 gb and the other one is 2TB. They are both txt files.
Let me say in the small data set, I have variables timestamp, ID and x. In the big one, I have timestamp, ID and y. Unique number of ID's and timestamps are much higher in the big data set.
For each observation in the small data, I want to find the row with the same milisecond and id in the big data and then copy the value of y to small data.
Is it possible to find corresponding rows without reading 2TB of data?
Thanks

  5 件のコメント

per isakson
2019 年 5 月 20 日
Now I regard your two data sets as tables and think that you are looking for code to make an inner join (SQL-lingo). I can think of a few different approaches
  • Read the datasets into a relational database and use SQL. Pro: concepually simple. Con: relational databases are typically not efficient with time series. (No Matlab.)
  • Loop over all the lines of the 10GB-dataset and search the 2TB-dataset. The search must be fast: binary search or maybe better interpolation search. In the FEX there is Fast Binary Search, Search algorithm to search for values in a sorted array. Written in C (.mex) for speed.(and more), but no interpolation search code. The search of the 2TB-dataset could be based on fgetl() together with fseek() or memmapfile()
  • Use the data type, table, together with the method innerjoin(). innerjoin() is implemented with m-code and I have no idea regarding performance.
Questions:
  • Are you looking for a code to use once and forget?
  • Are the time series of the separate text files of the 2TB-dataset regular (constant timestep)?
  • Are you forseeing other analyses using the 2TB-dataset?
Filip
2019 年 5 月 20 日
Responses to Questions:
  • No.
  • No.
  • Yes.
Looks like I will follow this path: creating a database and using SQL queries.
I am slightly confused about the comments about searvh algorithm. Currently I have trial version of the code for a subset of data sets, and I use "find" function of matlab together with @arrayfun. Is find function in Matlab not very efficient? Say my small data is d1 and the big one is d2. I have:
d1.y = arrayfun(@(z) d2.y(find(z>=d2.Tsec, 1,'last' )), d1.Tsec );
so that I get the y value from d2 when timestamp in d1 is for the first time bigger than or equal to timestamp in d2. I have to do this after I get subsets of data with the same ID.
But I do not want to read all 2TB of data since this is not a "use once and forget" code like you asked. Looks like my best bet is to create a database at this point.
Thank you all for help.
Jan
2019 年 5 月 22 日
The find function is very efficient, but not designed for sorted data.
A Bayer-tree https://en.wikipedia.org/wiki/B-tree would be an efficient method to stored sorted data, which do not match into the RAM.

サインイン to comment.

タグ

2 件の回答

per isakson
回答者: per isakson
2019 年 5 月 22 日
編集済み: per isakson
2019 年 5 月 22 日
 採用された回答

"I am slightly confused about the comments about search algorithm. [...] Is find function in Matlab not very efficient?"
I'm convinced that the Matlab function, find(), is efficient given the requirements. However, for the special case of a large sorted vector there are faster search algorithms. I made a little demo, which shows that simple m-code implementations of interpolation_search and binary_search are more than an order of magnitude faster than find() - searching a large sorted vector.
%%
A = sort( floor( cumsum( 1+9*rand(1e9,1) ) ) ); % sample data
key = A( 2E8 );
%%
tic; [ixIS,nIS] = interpolation_search( A, key ); toc
tic; [ixBS,nBS] = binary_search( A, key ); toc
tic; ixFL = find( A==key, 1, 'last' ); toc
tic; ixFF = find( A==key, 1, 'first' ); toc
%%
[ key ; A([ixIS,ixBS,ixFL,ixFF]) ]
outputs in the command window
Elapsed time is 0.002190 seconds.
Elapsed time is 0.001413 seconds.
Elapsed time is 0.916372 seconds.
Elapsed time is 0.608332 seconds.
ans =
1.0e+09 *
1.100021924000000
1.100021924000000
1.100021924000000
1.100021924000000
1.100021924000000
>>

  0 件のコメント

サインイン to comment.


Jan
回答者: Jan
2019 年 5 月 19 日
編集済み: per isakson
2019 年 5 月 20 日

It is a very bad idea to create a huge data set with a size of 2 TB as text files. Do the files have a fixed width for all lines? This would at least allow a binary search in each file. If you search linearily through 2 TB of text files millions of times, you might wait for years to get all results. With a modern data base a processing time in the magnitude of minutes is more likely.
Please blame the person who decided to use text files.
But now you have these files. Of course you can use a binary search: Find the central element of each file. If the row width is fixed, this is trivial. If not, start at the central byte and search for the next (or previous) line break. If the found value if higher than the searched one, use the first half of the file for the next search, or the seconds half in the other case. This requires log2(nData) file accesses only, where nData is the number of data sets.
If the input data is sorted, you can use this information also to refine the search by choosing a start point at the formerly found value.

  0 件のコメント

サインイン to comment.



Translated by