Clustering or cluster analysis is an unsupervised learning method used in machine learning and data analysis that organizes your data so that data points in the same group (or cluster) are more similar to each other than to those in other groups. Clustering helps to make sense of large and complex data sets by uncovering patterns and trends or making predictions on unlabeled data.
How Clustering Works
Clustering involves several key steps including data preparation, defining a similarity measure, choosing the right clustering algorithm, and evaluating and refining the clusters.
Clustering works by measuring the similarity between data points and grouping the points that have a higher measure of similarity than data in any other cluster. The concept of “similarity” varies depending on the context and the data, and it’s a fundamental aspect of unsupervised learning. Various similarity measures can be used, including Euclidean, probabilistic, cosine distance, and correlation.
Types of Clustering Algorithms
Clustering algorithms fall into two broad groups:
- Hard clustering: When each data point belongs to only one cluster, such as the popular k-means method
- Soft clustering: When each data point can belong to more than one cluster, such as in Gaussian mixture models
There are several clustering algorithms, and each clustering algorithm has a unique approach to grouping data. These methods vary significantly in their mechanics and ideal use cases. The most common types of clustering algorithms used in machine learning are:
- Hierarchical clustering builds a multilevel hierarchy of clusters by creating a cluster tree.
- k-means clustering partitions data into k distinct clusters based on the distance to the centroid of a cluster.
- Gaussian mixture models form clusters as a mixture of multivariate normal density components.
- Density-based spatial clustering (DBSCAN) groups points that are close to each other in areas of high density, keeping track of outliers in low-density regions. It can handle arbitrary nonconvex shapes.
- Self-organizing maps use neural networks that learn the topology and distribution of the data.
- Spectral clustering transforms input data into a graph-based representation where the clusters are better separated than in the original feature space. The number of clusters can be estimated by studying the eigenvalues of the graph.
- Hidden Markov models can be used to discover patterns in sequences, such as genes and proteins in bioinformatics.
- Fuzzy c-means (FCM) groups data into N clusters, with every data point in the data set belonging to every cluster to a certain degree.
Clustering for Unsupervised Learning
Unsupervised learning is a type of machine learning algorithm used to draw inferences from unlabeled data without human intervention. Clustering is the most common unsupervised learning method. It applies clustering algorithms to explore data and find hidden patterns or groupings in data without any prior knowledge of group labels. Using these groups and patterns, clustering helps to extract useful insights from unlabeled data and reveal inherent structures within it.
Why Clustering Is Important
Clustering is a significant area of artificial intelligence. It plays an important role in various domains by offering valuable insights into data and uncovering patterns and relationships that are not immediately obvious. For unlabeled data, where the inherent relationship between the data points is hidden but required for revealing useful insights, clustering helps in discovering those relationships and organizing the unlabeled data into meaningful groups.
By grouping similar items, clustering reduces data complexity so that you can focus on the behavior of the groups rather than getting overwhelmed by individual data points. So, clustering can be used for exploratory data analysis and semisupervised learning. In the latter, clustering is used as a preprocessing step before supervised learning to reduce the amount of data to be processed by a machine learning model and improve the predictive modeling accuracy.
Clustering is also frequently used in applications such as anomaly detection, image segmentation, and pattern recognition. More specifically, clustering can be applied in the following areas to identify patterns and sequences:
- Clusters can represent the data instead of the raw signal in data compression methods.
- Clusters indicate regions of images and lidar point clouds in segmentation algorithms.
- Clustering can assist in identifying outliers or anomalies within a data set.
- In medical imaging, clustering algorithms can be used to separate images into regions of interest, such as for differentiating between healthy tissue and tumors or segmenting the brain into white matter, gray matter, and cerebrospinal fluid.
- Clustering is used in geographic information systems (GISs) to analyze satellite imagery or aerial photographs to identify urban sprawl or land use patterns, or to monitor changes in urban areas over time.
- Genetic clustering and sequence analysis are used in bioinformatics.
Clustering with MATLAB
Using MATLAB® with Statistics and Machine Learning Toolbox™, you can identify patterns and features by applying clustering methods of your choice and dividing your data into groups or clusters. With Image Processing Toolbox™, you can perform clustering on image data.
Data Preparation
For accurate and efficient clustering results, it is vital to preprocess the data and handle missing values and outliers. You can clean and preprocess your data programmatically using built-in functions or interactively using the Data Cleaner app.
Clustering Algorithms
MATLAB supports all popular clustering algorithms, which you can apply with built-in functions, such as the kmeans
function. You can use the Cluster Data Live Editor task to interactively perform k-means and hierarchical clustering. Using the task, you can automatically generate MATLAB code for your live script.
You can also perform nearest-neighbors clustering in Simulink by using the KNN Search block. The block accepts a query point and returns the k nearest-neighbor points in the observational data using a nearest-neighbor searcher object.
Visualize and Evaluate Clustering Results
When the data does not contain natural divisions that indicate the appropriate number of clusters, you can use different evaluation criteria, such as gap or silhouette, to determine how well the data fits into a particular number of clusters. You can also visualize clusters to inspect clustering results. For example, you can use a dendrogram plot for clustering visualization.
Clustering for Images
You can perform image segmentation (using the imsegkmeans
function) and volume segmentation (using the imsegkmeans3
function) on images by clustering regions of pixels based on similarities in color or shape. You can create a segmented labeled image using a specific clustering algorithm. For example, in medical imaging you can detect and label pixels in an image or voxels of a 3D volume that represent a tumor in a patient’s brain or other organs. By leveraging MATLAB tools, you can process and analyze images for a wide range of applications, from disease diagnosis to land use classification.
Resources
Expand your knowledge through documentation, examples, videos, and more.
Related Topics
Explore similar topic areas commonly used with MATLAB and Simulink products.
30-Day Free Trial
Get startedWeb サイトの選択
Web サイトを選択すると、翻訳されたコンテンツにアクセスし、地域のイベントやサービスを確認できます。現在の位置情報に基づき、次のサイトの選択を推奨します:
また、以下のリストから Web サイトを選択することもできます。
最適なサイトパフォーマンスの取得方法
中国のサイト (中国語または英語) を選択することで、最適なサイトパフォーマンスが得られます。その他の国の MathWorks のサイトは、お客様の地域からのアクセスが最適化されていません。
南北アメリカ
- América Latina (Español)
- Canada (English)
- United States (English)
ヨーロッパ
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)