# Abstract

In unsupervised machine learning, models are trained using data that consists only of *feature* values without any known *labels*. The models determine relationships between the features of the observations in the training data.

# Clustering [ Reference]

- Identifies similarities between observations based on their features and groups them into discrete clusters.
- The label is the cluster to which the observation is assigned based only on its features.
- Examples:
- Flowers based on their size, number of leaves, and number of petals.
- Groups of similar customers based on demographic attributes and purchasing behavior.

Clustering is similar to multi-class classification except that the classes are unknown. In some cases, clustering is used to determine classes that are then used for training a classification model.

# Training Clustering Models

## K-Means algorithm

K-Means clustering is achieved with these steps:

- The feature ($x$) values are vectorized to define $n$-dimensional coordinates (where $n$ is the number of features).
- In the flower example, there are two features: number of leaves ($x_1$), and number of petals ($x_2$).
- The resulting feature vector is $[x_1,x_2]$

- The number of clusters you decide to use is $k$. $k$ points are plotted at random coordinates. These points become the center of each cluster (called
*centroids*). - Each data point is assigned to its nearest centroid.
- Each centroid is moved to the center of the data points assigned to it based on the mean distance between those points.
- After the centroid is moved, the data points may now be closer to a
*different*centroid. Those data points are reassigned to clusters based on the new closest centroid. - Steps 4 and 5 are repeated until the clusters become stable, or a predetermined maximum number of iterations is reached.

# Cluster Model Evaluation

Evaluation of a clustering model is based on how well the resulting clusters are separated from one another.

Metrics:

*Average distance to cluster center*— how close, on average, each point in the cluster is to the centroid of that cluster.*Average distance to other center*— how close, on average, each point in the cluster is to the centroid of all other clusters.*Maximum distance to cluster center*— the furthest distance between a point in the cluster and its centroid.*Silhouette*— a value between -1 and 1 that summarizes the ratio of distance between points in the same cluster and points in different clusters (the closer to 1, the better the cluster separation).