What Is Data Cleaning?
Data cleaning is the process of modifying data to remove or correct information in preparation for analysis. A common belief among practitioners is that 80% of analysis time is spent on this data cleaning phase. But why?
When data is collected, there are often various challenges to address. Data sets may contain missing points or outliers, or need to be merged with other data sets. Engineering and scientific data often has specific requirements, such as managing high-frequency timestamps, signal processing, and data labeling. You need to make decisions on how to deal with these data cleaning tasks.
This might sound painful, but it doesn’t have to be. MATLAB® provides many apps and functions for data cleaning tasks to make this phase faster and more informative so you can focus on your analysis and problem solving. For example, use MATLAB to:
- Explore, discover, and clean problems with time-series data with the Data Cleaner app.
- Synchronize, smooth, remove, or fill missing data and outliers with Live Editor tasks to experiment with individual data cleaning methods
- Call functions such as smoothdata and fillmissing, with many options for managing the data and convenient function hints.
- Quickly perform domain-specific data cleaning with the Signal Analyzer, Signal Labeler, and Image Labeler apps.
All of the apps and Live Editor tasks automatically generate MATLAB code to document and automate your interactive work.
Maybe you’ve heard it called “data wrangling” or “data munging,” referring to these different data cleaning steps required to prepare for analysis. Consider data for a system of weather sensors. The sensors could fail temporarily, leaving missing data points or outliers during that time. Different sensors are often recoded at different timesteps, so the data sets must be synchronized and interpolated where the times don’t match. These are just two examples, but there may be many more steps and decisions before you consider the data “clean.”
Common data cleaning tasks include:
- Filling or removing missing data and outliers
- Smoothing and detrending
- Identifying outliers, changepoints, and extrema
- Joining multiple data sets
- Time-based data cleaning, including sorting, shifting, and synchronizing
- Grouping and binning data
Mathematical algorithms are used to remedy these challenges. For example, you could fill the missing data points with the nearest neighbor or linear interpolation. Live Editor tasks and functions such as
smoothdata in MATLAB will help you explore common data cleaning methods and see the results immediately to make these decisions faster.
Machine and Deep Learning
There are often additional steps in data cleaning when creating predictive models. Consider object detection in images. The objects may need to be labeled in the images before developing an algorithm to classify them. Then the data must be organized appropriately depending on the type of algorithm (machine learning, deep learning), possibly using fewer data points, or “features,” which represent the objects. Even after training a model, you often assess feature importance, possibly repeating the process with different data cleaning steps to improve the classifications.
In general, the data goes through a pipeline like this:
- Data labeling
- General data cleaning
- Feature selection
- Train and test predictive model
- Tune and iterate on previous steps
- Deploy model to production
MATLAB provides apps and functions throughout this workflow. You can label classes for images, signals, audio, and video.
There are often more specific data cleaning needs, based on your domain, type of data, and application. For example, Statistics and Machine Learning Toolbox™, Signal Processing Toolbox™, Predictive Maintenance Toolbox™, Text Analytics Toolbox™, Computer Vision Toolbox™, and Audio Toolbox™ all include functionality and apps specific to data cleaning and wrangling for these formats and applications.
For more information, see the resources below.