How to find rows in table quickly based on time ranges specified in another table?

Question

Jori Selen 2023 年 10 月 30 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2040491-how-to-find-rows-in-table-quickly-based-on-time-ranges-specified-in-another-table

編集済み: Jori Selen 2023 年 10 月 31 日

I have a table T that contains a time start, time end and an ID. I have another table S that contains a timestamp and an ID. T typically has 50k rows, while S has ~1 million rows. Simplified example code for tables T and S:

rng(1);
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
timeEnd(randi(numel(timeEnd), 100, 1)) = NaT; % some data will be missing
id = [ 1:floor(numel(timeStart)/2) 1:ceil(numel(timeStart)/2) ]'; % id's are reused
T = table(timeStart, timeEnd, id);
head(T)
         timeStart                timeEnd           id
    ____________________    ____________________    __

    01-Jan-2023 00:00:00    02-Jan-2023 00:00:00    1 
    01-Jan-2023 00:02:52    02-Jan-2023 00:02:52    2 
    01-Jan-2023 00:05:45    02-Jan-2023 00:05:45    3 
    01-Jan-2023 00:08:38    02-Jan-2023 00:08:38    4 
    01-Jan-2023 00:11:31    02-Jan-2023 00:11:31    5 
    01-Jan-2023 00:14:24                     NaT    6 
    01-Jan-2023 00:17:16    02-Jan-2023 00:17:16    7 
    01-Jan-2023 00:20:09    02-Jan-2023 00:20:09    8 
size(T)
ans = 1×2
       50001           3
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
head(S)
         timeStamp           id  
    ____________________    _____

    01-Jan-2023 00:00:00     8167
    01-Jan-2023 00:00:08    13177
    01-Jan-2023 00:00:17    22150
    01-Jan-2023 00:00:25     8933
    01-Jan-2023 00:00:34    22715
    01-Jan-2023 00:00:43    15585
    01-Jan-2023 00:00:51      396
    01-Jan-2023 00:01:00    23237
size(S)
ans = 1×2
     1000001           2

For each row in T, I would like to find all the rows in S that has a timestamp within the (time start, time end) range and has the same ID as in T. Rows in T that do not have an timeEnd can be ignored. I see a few approaches:

Use isbetween() in a for loop around all the rows of T for the time-based check, but that was slow and really does not scale well with the number of rows in S (and linearly in the number of rows in T).
I could achieve this by converting S to a timetable and using timeranges constructed from T.timeStart and T.timeEnd together with a check on the T.id and S.id columns, but that seemed to be just as slow as approach 1.
I can convert all datetimes to unix time (using posixtime) and instead work with those as real numbers. This seems to be much faster than approaches 1 and 2.
I can take approach 3 and instead of using a for loop, compare all time ranges against all timestamps and create one giant (1 million by 50k) logical matrix indicating whether a row in S is within the timerange of each row of T. This matrix is sparse. Like all previous approaches, this does not scale well at all. To make memory use manageable, I can create a number of smaller (X by 50k) sparse matrices and then vertcat them at the end.

Does anyone see a good approach that scales well and keeps computation time manageable?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Voss 2023 年 10 月 30 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2040491-how-to-find-rows-in-table-quickly-based-on-time-ranges-specified-in-another-table#answer_1343761

編集済み: Voss 2023 年 10 月 30 日

MATLAB Online で開く

% input data construction (copied):
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
id = (1:numel(timeStart))';
T = table(timeStart, timeEnd, id);
rng(1);
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
% find which row of T contains each id in S:
[ism,idx] = ismember(S.id,T.id);
% make sure each id in S exists in T:
assert(all(ism));
% make a new table (S_new) that has the timeStamp from S and everything
% from T, but the rows of T are ordered according to the id in S:
S_new = [S(:,1), T(idx,:)];
head(S_new)
         timeStamp               timeStart                timeEnd            id  
    ____________________    ____________________    ____________________    _____

    01-Jan-2023 00:00:00    11-Feb-2023 16:50:52    12-Feb-2023 16:50:52    20852
    01-Jan-2023 00:00:08    14-Mar-2023 00:46:04    15-Mar-2023 00:46:04    36017
    01-Jan-2023 00:00:17    01-Jan-2023 00:14:24    02-Jan-2023 00:14:24        6
    01-Jan-2023 00:00:25    31-Jan-2023 05:34:04    01-Feb-2023 05:34:04    15117
    01-Jan-2023 00:00:34    15-Jan-2023 16:10:33    16-Jan-2023 16:10:33     7338
    01-Jan-2023 00:00:43    10-Jan-2023 05:36:57    11-Jan-2023 05:36:57     4618
    01-Jan-2023 00:00:51    19-Jan-2023 15:01:26    20-Jan-2023 15:01:26     9314
    01-Jan-2023 00:01:00    04-Feb-2023 13:20:38    05-Feb-2023 13:20:38    17279
size(S_new)
ans = 1×2
     1000001           4
% now keep only rows in S_new where timeStamp is between timeStart and timeEnd:
to_keep = S_new.timeStamp >= S_new.timeStart & S_new.timeStamp <= S_new.timeEnd;
S_new = S_new(to_keep,:);
head(S_new)
         timeStamp               timeStart                timeEnd           id 
    ____________________    ____________________    ____________________    ___

    01-Jan-2023 02:58:24    01-Jan-2023 01:49:26    02-Jan-2023 01:49:26     39
    01-Jan-2023 03:50:24    01-Jan-2023 02:26:52    02-Jan-2023 02:26:52     52
    01-Jan-2023 06:14:15    01-Jan-2023 04:42:14    02-Jan-2023 04:42:14     99
    01-Jan-2023 06:59:11    01-Jan-2023 02:38:24    02-Jan-2023 02:38:24     56
    01-Jan-2023 07:14:52    01-Jan-2023 06:17:16    02-Jan-2023 06:17:16    132
    01-Jan-2023 07:31:00    01-Jan-2023 06:31:40    02-Jan-2023 06:31:40    137
    01-Jan-2023 07:58:04    01-Jan-2023 07:32:09    02-Jan-2023 07:32:09    158
    01-Jan-2023 08:03:59    01-Jan-2023 01:52:19    02-Jan-2023 01:52:19     40
size(S_new)
ans = 1×2
       10033           4

2 件のコメント
なしを表示なしを非表示

Jori Selen 2023 年 10 月 31 日

Thanks for the nice solution. I've tried implementing it for my real case and ran into an issue. I should have been a bit more precise in my example: id's in the table T are also not unique and are reused. However, no 2 rows will have the same id and have overlap in their (timeStart, timeEnd) intervals. I will adapt the example.

Jori Selen 2023 年 10 月 31 日

In addition, some timeEnd fields are missing. That is not a problem for your solution, but I've added it to the example as well.

サインインしてコメントする。

Answer 2

Jori Selen 2023 年 10 月 31 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2040491-how-to-find-rows-in-table-quickly-based-on-time-ranges-specified-in-another-table#answer_1344211

編集済み: Jori Selen 2023 年 10 月 31 日

MATLAB Online で開く

I've come up with the solution below, but it still takes ~50 seconds on my machine with approximately all time spent on the sameID{iGroup} = sparse(S.id(idxStart:idxEnd) == T.id'); line. It should be possible to create something considerably faster, but I don't see how.

rng(1);
% Create table T
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
timeEnd(randi(numel(timeEnd), 100, 1)) = NaT; % some data will be missing
id = [ 1:floor(numel(timeStart)/2) 1:ceil(numel(timeStart)/2) ]'; % id's are reused
T = table(timeStart, timeEnd, id);
nT = rows(T);
% Create table S
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
nS = rows(S);
% Determine which rows in T and S have the same ID
MAX_ELEMENTS_PER_GROUP = 1e9;
nElementsPerGroup = ceil(MAX_ELEMENTS_PER_GROUP / nT);
nGroups = ceil(nS / nElementsPerGroup);
sameID = cell(nGroups, 1);
for iGroup = 1:nGroups % constructed incrementally to not run into memory issues
    idxStart = (iGroup - 1)*nElementsPerGroup + 1;
    idxEnd = min(iGroup*nElementsPerGroup, nS);
    sameID{iGroup} = sparse(S.id(idxStart:idxEnd) == T.id');
end
sameID = vertcat(sameID{:}); % sparse logical matrix with nS rows and nT columns
% For each row in T, determine which rows of S have the same ID and have a
% timeStamp within the [timeStart, timeEnd] timerange
idx = 1:nS;
idxAssociatedRowsInS = cell(nT, 1);
for iT = 1:nT
    idxSameID = idx(sameID(:, iT));
    withinTimeRange = isbetween(S.timeStamp(idxSameID), T.timeStart(iT), T.timeEnd(iT));
    idxAssociatedRowsInS{iT} = idxSameID(withinTimeRange);
end
T.idxAssociatedRowsInS = idxAssociatedRowsInS; 
% any other way of storing the same information in T (or a joined table) is also fine

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

How to find rows in table quickly based on time ranges specified in another table?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (2 件)

2 件のコメント
なしを表示なしを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

How to find rows in table quickly based on time ranges specified in another table?

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (2 件)

2 件のコメント なしを表示なしを非表示

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

2 件のコメント
なしを表示なしを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示