フィルターのクリア

How to find rows in table quickly based on time ranges specified in another table?

5 ビュー (過去 30 日間)
Jori Selen
Jori Selen 2023 年 10 月 30 日
編集済み: Jori Selen 2023 年 10 月 31 日
I have a table T that contains a time start, time end and an ID. I have another table S that contains a timestamp and an ID. T typically has 50k rows, while S has ~1 million rows. Simplified example code for tables T and S:
rng(1);
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
timeEnd(randi(numel(timeEnd), 100, 1)) = NaT; % some data will be missing
id = [ 1:floor(numel(timeStart)/2) 1:ceil(numel(timeStart)/2) ]'; % id's are reused
T = table(timeStart, timeEnd, id);
head(T)
timeStart timeEnd id ____________________ ____________________ __ 01-Jan-2023 00:00:00 02-Jan-2023 00:00:00 1 01-Jan-2023 00:02:52 02-Jan-2023 00:02:52 2 01-Jan-2023 00:05:45 02-Jan-2023 00:05:45 3 01-Jan-2023 00:08:38 02-Jan-2023 00:08:38 4 01-Jan-2023 00:11:31 02-Jan-2023 00:11:31 5 01-Jan-2023 00:14:24 NaT 6 01-Jan-2023 00:17:16 02-Jan-2023 00:17:16 7 01-Jan-2023 00:20:09 02-Jan-2023 00:20:09 8
size(T)
ans = 1×2
50001 3
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
head(S)
timeStamp id ____________________ _____ 01-Jan-2023 00:00:00 8167 01-Jan-2023 00:00:08 13177 01-Jan-2023 00:00:17 22150 01-Jan-2023 00:00:25 8933 01-Jan-2023 00:00:34 22715 01-Jan-2023 00:00:43 15585 01-Jan-2023 00:00:51 396 01-Jan-2023 00:01:00 23237
size(S)
ans = 1×2
1000001 2
For each row in T, I would like to find all the rows in S that has a timestamp within the (time start, time end) range and has the same ID as in T. Rows in T that do not have an timeEnd can be ignored. I see a few approaches:
  1. Use isbetween() in a for loop around all the rows of T for the time-based check, but that was slow and really does not scale well with the number of rows in S (and linearly in the number of rows in T).
  2. I could achieve this by converting S to a timetable and using timeranges constructed from T.timeStart and T.timeEnd together with a check on the T.id and S.id columns, but that seemed to be just as slow as approach 1.
  3. I can convert all datetimes to unix time (using posixtime) and instead work with those as real numbers. This seems to be much faster than approaches 1 and 2.
  4. I can take approach 3 and instead of using a for loop, compare all time ranges against all timestamps and create one giant (1 million by 50k) logical matrix indicating whether a row in S is within the timerange of each row of T. This matrix is sparse. Like all previous approaches, this does not scale well at all. To make memory use manageable, I can create a number of smaller (X by 50k) sparse matrices and then vertcat them at the end.
Does anyone see a good approach that scales well and keeps computation time manageable?

回答 (2 件)

Voss
Voss 2023 年 10 月 30 日
編集済み: Voss 2023 年 10 月 30 日
% input data construction (copied):
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
id = (1:numel(timeStart))';
T = table(timeStart, timeEnd, id);
rng(1);
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
% find which row of T contains each id in S:
[ism,idx] = ismember(S.id,T.id);
% make sure each id in S exists in T:
assert(all(ism));
% make a new table (S_new) that has the timeStamp from S and everything
% from T, but the rows of T are ordered according to the id in S:
S_new = [S(:,1), T(idx,:)];
head(S_new)
timeStamp timeStart timeEnd id ____________________ ____________________ ____________________ _____ 01-Jan-2023 00:00:00 11-Feb-2023 16:50:52 12-Feb-2023 16:50:52 20852 01-Jan-2023 00:00:08 14-Mar-2023 00:46:04 15-Mar-2023 00:46:04 36017 01-Jan-2023 00:00:17 01-Jan-2023 00:14:24 02-Jan-2023 00:14:24 6 01-Jan-2023 00:00:25 31-Jan-2023 05:34:04 01-Feb-2023 05:34:04 15117 01-Jan-2023 00:00:34 15-Jan-2023 16:10:33 16-Jan-2023 16:10:33 7338 01-Jan-2023 00:00:43 10-Jan-2023 05:36:57 11-Jan-2023 05:36:57 4618 01-Jan-2023 00:00:51 19-Jan-2023 15:01:26 20-Jan-2023 15:01:26 9314 01-Jan-2023 00:01:00 04-Feb-2023 13:20:38 05-Feb-2023 13:20:38 17279
size(S_new)
ans = 1×2
1000001 4
% now keep only rows in S_new where timeStamp is between timeStart and timeEnd:
to_keep = S_new.timeStamp >= S_new.timeStart & S_new.timeStamp <= S_new.timeEnd;
S_new = S_new(to_keep,:);
head(S_new)
timeStamp timeStart timeEnd id ____________________ ____________________ ____________________ ___ 01-Jan-2023 02:58:24 01-Jan-2023 01:49:26 02-Jan-2023 01:49:26 39 01-Jan-2023 03:50:24 01-Jan-2023 02:26:52 02-Jan-2023 02:26:52 52 01-Jan-2023 06:14:15 01-Jan-2023 04:42:14 02-Jan-2023 04:42:14 99 01-Jan-2023 06:59:11 01-Jan-2023 02:38:24 02-Jan-2023 02:38:24 56 01-Jan-2023 07:14:52 01-Jan-2023 06:17:16 02-Jan-2023 06:17:16 132 01-Jan-2023 07:31:00 01-Jan-2023 06:31:40 02-Jan-2023 06:31:40 137 01-Jan-2023 07:58:04 01-Jan-2023 07:32:09 02-Jan-2023 07:32:09 158 01-Jan-2023 08:03:59 01-Jan-2023 01:52:19 02-Jan-2023 01:52:19 40
size(S_new)
ans = 1×2
10033 4
  2 件のコメント
Jori Selen
Jori Selen 2023 年 10 月 31 日
Thanks for the nice solution. I've tried implementing it for my real case and ran into an issue. I should have been a bit more precise in my example: id's in the table T are also not unique and are reused. However, no 2 rows will have the same id and have overlap in their (timeStart, timeEnd) intervals. I will adapt the example.
Jori Selen
Jori Selen 2023 年 10 月 31 日
In addition, some timeEnd fields are missing. That is not a problem for your solution, but I've added it to the example as well.

サインインしてコメントする。


Jori Selen
Jori Selen 2023 年 10 月 31 日
編集済み: Jori Selen 2023 年 10 月 31 日
I've come up with the solution below, but it still takes ~50 seconds on my machine with approximately all time spent on the sameID{iGroup} = sparse(S.id(idxStart:idxEnd) == T.id'); line. It should be possible to create something considerably faster, but I don't see how.
rng(1);
% Create table T
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
timeEnd(randi(numel(timeEnd), 100, 1)) = NaT; % some data will be missing
id = [ 1:floor(numel(timeStart)/2) 1:ceil(numel(timeStart)/2) ]'; % id's are reused
T = table(timeStart, timeEnd, id);
nT = rows(T);
% Create table S
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
nS = rows(S);
% Determine which rows in T and S have the same ID
MAX_ELEMENTS_PER_GROUP = 1e9;
nElementsPerGroup = ceil(MAX_ELEMENTS_PER_GROUP / nT);
nGroups = ceil(nS / nElementsPerGroup);
sameID = cell(nGroups, 1);
for iGroup = 1:nGroups % constructed incrementally to not run into memory issues
idxStart = (iGroup - 1)*nElementsPerGroup + 1;
idxEnd = min(iGroup*nElementsPerGroup, nS);
sameID{iGroup} = sparse(S.id(idxStart:idxEnd) == T.id');
end
sameID = vertcat(sameID{:}); % sparse logical matrix with nS rows and nT columns
% For each row in T, determine which rows of S have the same ID and have a
% timeStamp within the [timeStart, timeEnd] timerange
idx = 1:nS;
idxAssociatedRowsInS = cell(nT, 1);
for iT = 1:nT
idxSameID = idx(sameID(:, iT));
withinTimeRange = isbetween(S.timeStamp(idxSameID), T.timeStart(iT), T.timeEnd(iT));
idxAssociatedRowsInS{iT} = idxSameID(withinTimeRange);
end
T.idxAssociatedRowsInS = idxAssociatedRowsInS;
% any other way of storing the same information in T (or a joined table) is also fine

カテゴリ

Help Center および File ExchangeData Type Conversion についてさらに検索

製品


リリース

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by