Fast move in datastore (TabularTextDatastore)
古いコメントを表示
Hi,
I'm using TabularTextDatastore to iteratively read from several huge (~GB) text files. Is there a way how to move in a datastore say ~millions of rows without actually reading the data (some kind of offset)?
Note:
a) One step reading is impractical due the memory requirements.
b) Iterative reading of smaller portions requires some time...
c) Need to check data while actually reading due to this problem https://www.mathworks.com/matlabcentral/answers/260298-how-to-solve-problem-with-smaller-number-of-records-read-from-datastore
Thanks for your help.
Adam
採用された回答
その他の回答 (2 件)
Aaditya Kalsi
2017 年 10 月 26 日
If where you want to seek to in the datastore is approximate, there may be a way to do this using PARTITION:
% divide the datastore into 1000 parts and pick the 4th
subds = partition(ds,1000,4);
Granted that this is not exact, but it may be what you are looking for.
1 件のコメント
AnaelG
2018 年 2 月 9 日
Not sure it applies here but this worked for me to at least be able to skip to the Nth file.
ds= tabularTextDatastore(files...);
ds.ReadSize= 'file';
numFiles= length(ds.Files);
tableN= read(ds.partition('Files', N ))
Adam Koutný
2017 年 10 月 27 日
1 件のコメント
Jeremy Hughes
2017 年 10 月 27 日
Hmmm, you might try using tall arrays. This is newer than datastore, and I don't know which release you're using.
>> ds = tabularTextDatastore(files...);
>> T = tall(ds)
>> subT = T(T.Var1 > 3 & T.Var1 < 6,:)
>> gather(subT)
Tall arrays let you operate on the datastore in many of the same ways you can work on in-memory arrays. If you have Parallel Computing Toolbox, you can execute your calculations on multiple workers. You can also use the same SelectedVariableNames optimization.
I believe this will help your workflow. Also, look into TIMETABLE which has additional features for working with timebased data. Based on your description, it sounds like you might be able to do what you're looking for with something like this:
ds = tabularTextDatastore('airlinesmall.csv')
ds.SelectedVariableNames = {'Year','Month','DayofMonth'}
T = tall(ds)
TT = table2timetable(T,'RowTimes',datetime(T.Year,T.Month,T.DayofMonth))
subTT = TT(timerange('16-Oct-1987','21-Oct-1987'),:)
gather(subTT)
Hope this helps,
Jeremy
カテゴリ
ヘルプ センター および File Exchange で Deep Learning Toolbox についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!