Unsupervised Learning

11 minute read

Details for this project are available on GitHub.

Objective:

Six different algorithms are implemented; the first two are clustering – k-means clustering and Expectation Maximization and the last four are dimensionality reduction algorithms – PCA, ICA, Randomized Projections, and Random Forest. The experiments are split into four main parts. All algorithms are evaluated using Scikit-learn in Python.

Part 1:
- Applies clustering on both datasets
Part 2:
- Applies dimensionality reduction algorithms on both datasets
- Reproduces clustering experiments with dimensionality reduction on both datasets
Part 3:
- Applies dimensionality reduction algorithms to the Faulty Plates datasets
- Run Neural Network on the data
Part 4:
- Applies clustering algorithms and uses the clusters as new features for the Neural Network on the Faulty Plates dataset

Datasets used:

Faulty Steel Plates

1,941 instances; 27 attributes; 7 labels
Steel Plates are classified into 7 different faulty categories - Pastry, Z_Scratch, K_Scratch, Stains, Dirtiness, Bumps, and Other_Faults
Other_Faults ~ 35%, Bumps ~ 20%, K_Scratch ~ 20%
All features are numerical

Breast Cancer

569 instances, 31 attributes, binary label
Features are measurements calculated from a digitized image of a breast mass
Classifies whether a breast mass is malignant or benign
357 Benign, 212 Malignant
All features are real valued

Why are the datasets interesting?

Faulty Steel Plates
Identifying faulty steel plates:

Improve safety and reduce costs (return fees)
Reduce amount of defective plates used and in circulation
Applicable to evaluating other types of defective metals

Breast Cancer
Identifying breast cancer:

Breast Cancer affects about 200,000 women a year in the U.S.
About 12% of U.S. women develop breast cancer in their lifteime
Early identification of malignant breast mass cells using machine learning is very beneficial for prevention, treatment, and potentially saves lives.

Algorithms Used:

Clustering (k-means and EM)
Neural Network

Dimensionality Reduction

Principal Component Analysis (PCA)
Independent Component Analysis (ICA)
Random Projections (RP)
Random Forest (RF)

Part 1: Clustering

Clustering is an unsupervised learning algorithm that groups a set of observations together that are similar to each other compared to those in other groups.

K-Means:

Takes in a parameter k, for the amount of clusters, and randomly generates k means
K-clusters are formed based on associating each observation to the closest mean
The least squared Euclidean distance is used for measurement
The center of each of the clusters becomes the new mean
These steps are iterated until convergence is reached

Expectation-Maximization:

Iterative method using maximum likelihood to find the clusters means
Alternates between a soft clustering (Expectation) and computing the means of a soft cluster (Maximization)
Expectation calculates the likelihood that the observation is in a certain cluster based on the mean
Maximization computes the means from likelihoods, using weighted averages of the data points
Each step maximizes the likelihood of the distribution until convergence.

To find a good k:

K-means:
- Sum of Squared Distances within clusters vs # of clusters
  - Elbow point where SSE decreases sharply, can be used to determine k
Expectation-Maximization:
- Log-likelihood
  - Calculates the likelihood that the data is to be generated by the parameters estimated
  - Higher likelihood means that the data is more likely to be generated by the estimated parameters (tradeoff with overfitting)
Both:
- Silhouette Score:
  - Takes into account both intra and inter cluster distances
    - Explains how similar an observation is to its own cluster compared to other clusters.
    - Range is from -1 to 1
      - 1 best value meaning it matches well with its own cluster and far from other clusters
      - 0 indicates overlapping clusters

Evaluation of clusters:

Accuracy
- Measures the percentage of the predicted label matching with the true label
Adjusted Mutual Information Score
- Measures the similarity between two labels of the same data between two clusters
- Takes into account chance
- Range is from 0 to 1
  - 1 means that the labels agree

Faulty Steel Plates: Clustering

Evaluation:

K-means: SSE
- Elbow at k=9
EM: Log-Likelihood
- Spike at k=7 and k=9
Silhouette Score
- Spike at k=9 for k-means, but dip for EM
- k=6 experiences a spike for both clustering methods
Accuracy and Adjusted Mutual Information
- Peaks at k=9 for both

k=7 makes sense since there are seven different labels for the faulty steel plates datasets. k=9 also makes sense because there is an “other” category that could contain two distinct faulty plates labeled as “other”.

Breast Cancer: Clustering

Evaluation:

K-means: SSE
- Elbow at k=7
EM: Log-Likelihood
- Spike at k=7
Silhouette Score
- Spike at k=7 for both
Accuracy and Adjusted Mutual Information
- Accuracy highest at k=7 and AdjMI leveled out somewhat at k=7

k=7 can make sense because there could possibly be seven different types of breast mass cells that could be identified as either malignant or benign.

Part 2. Dimensionality Reduction and Clustering:

Principal Component Analysis (PCA):

Principal component analysis uses orthogonal transformation and linear combination to identify important components that maximizes variance
PCA is used to reduce a large set of features into a subset that still contains most of the information
To determine dimensional reduction for PCA, examine:
- The variance explained by the components
- Distribution of eigenvalues
- Elbow method can be used to evaluate the number of principal components to choose

Faulty Steel Plates: PCA

FP PCA

Most of the variance can be explained by the first 12 principal components
Original dataset has 27 features, now reduced to 12 principal components

Faulty Steel Plates: PCA Clustering

Using 12 principal components for clustering:

K-means: SSE
- Slight elbow at k=7
- Reduces SSE from 24,000 to 14,000, does well reducing inter cluster error
EM: Log-Likelihood
- Slight spike at k=7
Silhouette Score
- Low silhouette score for EM shows that it doesn’t do as well as k-means in separating the clusters. Possible overlapping clusters.
- Slight spike for k=7 for both
Accuracy and Adjusted Mutual Information
- Shows that EM has relatively good prediction accuracy compared to k-means

Good cluster at k=7

Breast Cancer: PCA

FP PCA

Most of the variance is explained by the first 7 principal components
Original dataset has 30 features, now reduced to 7 principal components

Breast Cancer: PCA Clustering

Using 7 principal components for clustering:

K-means: SSE
- Elbow method doesn’t show a clear dip
- SSE shows a reduction of 1/3 from original clustering error, which means PCA reduces within cluster error, but does worse intra-cluster
EM: Log-Likelihood
- Slight spike at k=6
Silhouette Score
- Low silhouette score for EM shows that it doesn’t do as well as k-means in separating the clusters
- Slight spike for k=6 for k-means
Accuracy and Adjusted Mutual Information
- PCA clustering doesn’t do as well accuracy wise compared to regular clustering

Good cluster at k=6

Independent Component Analysis (ICA):

Independent component analysis tries to decompose data into independent non-Gaussian components
Maximizes mutual information between the original data and the independent components
The sub-components are assumed to be non-Gaussian and independent from each other Evaluation:
The number of independent components to choose can be evaluated by their kurtosis values since kurtosis measures gaussianity and ICA tries to maximize non-gaussianity
A kurtosis near 3 is gaussian, so it’s best to find a kurtosis that has the highest absolute value of the mean of the kurtosis

Faulty Steel Plates: ICA

FP ICA

Kurtosis peaks at 4
Original dataset has 27 features and now 4 indpenedent components are selected. A huge reduction in dimensionality

Faulty Steel Plates: ICA Clustering

Using 4 independent components for clustering:

K-means: SSE
- Elbow at k=5
- The within cluster SSE decreases significantly compared to the original dataset
EM: Log-Likelihood
- Slight spike at k=7
Silhouette Score
- Low silhouette score for EM shows that it doesn’t do as well as k-means in separating the clusters. Possible overlapping clusters.
- Slight spike for k=7 for both
Accuracy and Adjusted Mutual Information
- Shows that EM has relatively good prediction accuracy compared to k-means
- Peaks at k=7 for both Good cluster at k=7

Breast Cancer: ICA

FP ICA

Kurtosis keeps increasing, meaning non-gaussianity increases as the number of independent components increases
Original dataset has 30 features, but 25 independent components is enough since it starts leveling off

Breast Cancer: ICA Clustering

Using 25 independent components for clustering:

K-means: SSE
- Elbow method doesn’t show a clear dip
- SSE decreases from the original clustering
EM: Log-Likelihood
- Clear peak at k=6 and 8
Silhouette Score
- Produces similar cluster distances for k-means and EM
- Slight spike for k=6 for k-means
Accuracy and Adjusted Mutual Information
- ICA clustering has less accuracy than PCA and the original dataset
- EM has highest accuracy at k=7
- K-means has peak at k=7 Good cluster at k=6 and k=7

Random Projection (RP):

Reduces dimensions randomly using a Gaussian distribution
Benefits of random projections that it’s computationally efficient and works well on low dimensions
May perform pretty poorly based on one random generation, so 10 iterations are run and averaged for evaluation.
Goal of random projection:
- Average Pairwise Distance Correlation: preserve the pairwise distances between any two samples of the dataset, so we want to maximize the variance and average pairwise distance correlation
- Average Reconstruction Error: minimize the reconstruction error, which is the squared distance between the original data and the estimate

Faulty Steel Plates: RP

FP RP

Average Pairwise Distance Correlation starts leveling off around 7 dimensions
The Average Reconstruction error keeps decreasing for higher dimensions
- Tradeoff between overfitting and computational time

Faulty Steel Plates: RP Clustering

Using 7 dimensions for clustering:

K-means: SSE
- Slight elbow at k=5
- The within cluster SSE decreases significantly compared to the original dataset
EM: Log-Likelihood
- Slight spike at k=9
Silhouette Score
- Spike at k=7 for both
Accuracy and Adjusted Mutual Information
- Accuracy for EM shows that k=9 is a good cluster
- Accuracy for k-means shows that k=7 is a good cluster
- Accuracy for RP EM is the highest so far for all dim algorithms at k=9

Good cluster at k=7

Breast Cancer: RP

FP RP

Average Pairwise Distance Correlation starts levling off around 7 dimensions
Average reconstruction error keeps decreasing for higher dimensions

Breast Cancer: RP Clustering

Using 7 dimensions for clustering:

K-means: SSE
- Elbow at k=5
- SSE decreases from the original clustering due to dim reduction
EM: Log-Likelihood
- Peak around k=8
Silhouette Score
- On average has improved for k-means from the original clustering
- Peak at k=8 for EM
- Peak at k=5 for k-means
Accuracy and Adjusted Mutual Information
- Accuracy is much lower than the original dataset
- Peaks at k=5

Good cluster at k=5

Random Forest (RF):

Strong learner that is an ensemble of weak learner decision trees
Feature selection for Random Forest is based on feature importance
Feature importance is measured by Gini importance, which is the total decrease in node impurity reaching that node averaged over all trees of the ensemble.
- The higher the Gini importance value, the more important the feature
- Elbow method used to evaluate the number of features to use

Faulty Steel Plates: RF

FP RF

Faulty Plates dataset has a lot of important features, and has a sharp decline at 25 features.
Decrease in dimensionality from 27 to 24

Faulty Steel Plates: RF Clustering

Using 25 dimensions for clustering:

K-means: SSE
- Doesn’t really have an elbow point
- Doesn’t decrease within cluster error in k-means as much as the other algorithms
EM: Log-Likelihood
- Peak at k=7
Silhouette Score
- k=7 for k-means
- k=8 for EM
Accuracy and Adjusted Mutual Information
- Accuracy shows that k=9 and k=4 Good clusters at k=4 and k=9

Breast Cancer: RF

FP RF

10 important features, because Gini Importance starts to level off after that
Features reduced from 30 to 10 using Random Forest

Breast Cancer: RF Clustering

Using 10 dimensions for clustering:

K-means: SSE
- Elbow at k=7
EM: Log-Likelihood
- Leveling out around k=9
Silhouette Score
- For both shows a consistent decrease
- As the number of clusters get larger, the cluster distances for both start decreasing, hard to tell what a good k is
Accuracy and Adjusted Mutual Information
- Accuracy shows that k=3, 6, and 10 are good clusters Good clusters at k=3, 6, and 10

Part 3: Dimensionality Reduction Algorithms with Neural Network

The faulty plates dataset is used to evaluate the dimensional reduction algorithms as inputs with the Neural Network compared to the baseline Neural Network
The baseline for the Neural Network is a 71.9% with hyperparameters of learning rate = 0.1, momentum = 0.3, and one hidden layer with 14 neurons.

DR NN

Compares accuracy of different reduction algorithms on the Neural Network to the benchmark Neural Network
Accuracy shows that PCA and Random Projection perform well on the NN at higher components
With around 15 components, the accuracy is close to the benchmark
Dimensions are much lower than the original 27 features, so it simplifies model complexity
Compute times are similar between the benchmark, random projection, and PCA.
Part 4: Cluster Features with Neural Network
Clusters are used as features for the neural network
Baseline is the same as above

CF NN

When clusters are used as features, it doesn’t do a good job of predicting the labels
Performs far below benchmark accuracy of 71.9%
Seems like accuracy is best for EM and k-means when k=6

Conclusion:

The dimensions for each of the components in the clustering algorithms in Part 2 along with computational times were as follows:

Conclusion

Computational times were similar for PCA and ICA based on the dimensions
Random preojection did best in computational time and in lower dimensions
Random Forest used as feature selection takes too much computational time

Most of the dimensionality reductions do help in reducing SSE within clusters. PCA and ICA do not increase the intra-cluster distance, but Random Projection and Random Forest does.

Share on

Twitter Facebook LinkedIn

JJ Gong

Unsupervised Learning

Objective:

Datasets used:

Why are the datasets interesting?

Part 1: Clustering

Evaluation:

Evaluation:

Part 2. Dimensionality Reduction and Clustering:

Principal Component Analysis (PCA):

Independent Component Analysis (ICA):

Random Projection (RP):

Random Forest (RF):

Part 3: Dimensionality Reduction Algorithms with Neural Network

Part 4: Cluster Features with Neural Network

Conclusion:

Share on

You may also enjoy

Twitter Sentiment & COVID-19

Product Recommendations for E-commerce

Practicum - NHANES

Supervised Learning