## Cleaning of large files of data

Abm

### Abm (view profile)

さんによって質問されました 2019 年 6 月 7 日

### dpb (view profile)

さんによって コメントされました 2019 年 6 月 11 日
dpb

### dpb (view profile)

さんの 回答が採用されました
Hello all,
I am working with large files of data of solar irradiance and electricity consumption in order to study the reliability of a microgrid. The data contains minutely samples for one year and there are so many missing values in each file, so I would like to clean it before analyize it but dont know how. I am a beginner in matlab and appreciate all help you can offer me.
One possible way stands in the litterature about dealing with this issue is to group the data into 24-hours subset, each willl coresponding to a calender day. any 24-h subset with more than four hours of missing data, consecutive or discarded. The missing values for the remaining 24-h subset can be synthesized through linear interpolation. Then I want to convert the minutly data to hourly averages.
The most important thing for me that both solar irradiance and load data har data i samma tid so that I can analyize them.
So much thanks to alla feedback.
Best regards.

Abm

### Abm (view profile)

on 8 Jun 2019
Thank you @dpb,
I see now, So maybe
if I use :
Data_h=retime(Data_m,'hourly','mean')
to calculate the hourly mean value, which still have some NAN values, then maybe write a code to loop through every 24-hours of the Data_h and check if it have more than 4 missing value then delete the day, and if it have less than 4 missing values, then do linear interpolation, am I right?
dpb

### dpb (view profile)

on 9 Jun 2019
See my above logic looking at the difference in time vector after removing isnan() elements--any day in which that difference >=4 is to be removed entirely; all others can be retained from the full file and interpolated.
That's doable without looping I believe.
Abm

### Abm (view profile)

on 9 Jun 2019
@dpb, thank you so much! could you suggest a way to write it in matlab? I am still confused!

サインイン to comment.

## 2 件の回答

on 10 Jun 2019
Edited by dpb

### dpb (view profile)

on 10 Jun 2019

Start something like
ttclean=rmissing(tt); % the subset with no missing data only
dt=diff(ttclean.Time); % the difference in timestampe in clean data
is4hrgap=any(dt>=hours(4)); % are there any gaps of 4hr or more?
If the above is false (0), you can use the previous result with linear interpolation; you've proved there is no gap of 4 hours or more in the data.
OTOH, if the above is true (1), then you need to find which days are those--this can be done by retrieving the individual elements of the logical vector
ixGap=find(dt>=hours(4))+1; % the indices in clean series of gaps
From this index, return the year,month and day of those elements and then remove those days from the original interpolated series. (If you remove and then re-interpolate with retime, it will just fill in all the elements again so you have to remove them afterwards).
gapDays=unique(dateshift(ttclean.Time(ixGap)); % the unique days
Those days should be removed from the interpolated result above as Walter showed.

Abm

### Abm (view profile)

on 11 Jun 2019
Hi guys,
When I run the line for unique days
Data_h_mean=retime(Data_m,'hourly','mean');% data converted from minutly to hourly
Dataclean=rmmissing(Data_h_mean);
dt=diff(Dataclean.Time) ;
is4hrgap=any(dt>=hours(4))
ixGap=find(dt>=hours(4))+1
size(Dataclean)
gapDays=unique(dateshift(Dataclean.Time(ixGap))); % the unique days
I get this error:
Error using datetime/dateshift (line 75)
Second input must be 'start', 'end', or 'dayofweek'.
Error in test_func (line 52)
gapDays=unique(dateshift(Dataclean.Time(ixGap))); % the unique
days
so I started to delete the lines manually, like this:
Dataclean([the unwanted rows],:)=[]
but it take so much time sine they are so many, so I wonder, is there any easier way to delete the undesired days from the data??
Thanks in advance !
Abm

### Abm (view profile)

on 11 Jun 2019
I tried this method also, but it only delete one hour and not the entire day which I wanted:
todelete=Dataclean.Time(ixGap)
Dataclean(todelete,:)=[];
size(Dataclean)
so, any suggestion on how to delete the many days from the hourly data?
dpb

### dpb (view profile)

on 11 Jun 2019
Oh...I was just typing in the edit window and left off the 'start' parameter in computing the day times. (I would have thought that would have been apparent to you when you looked up the documentation for dateshift to see what the error was and saw the missing parameter that the point of the exercise is to get the day(s) that have gaps in a list, though).
At that point, then the set member functions should be useful to find the intersection of the time data between that list and the included times -- remember to compare you'll have to also look at matching the day only part in the time table time stamp by also doing a dateshift on them; days won't match to day+time.

サインイン to comment.

### Cris LaPierre (view profile)

Answer by Cris LaPierre

### Cris LaPierre (view profile)

on 9 Jun 2019

Consider looking through this example in MATLAB. It might give you some ideas of how to handle the missing data.

#### 0 件のコメント

サインイン to comment.

Translated by