Explore the effectiveness of data classification with PCA

Principal component analysis (PCA) is a great tool for data scientists to use. It can be used to reduce the dimensionality of the feature space and generate irrelevant features. As we'll see, it can also help you gain insight into the classification capabilities of your data. We'll walk you through how to use PCA in this way. Python code snippets are provided, and the full project can be found on GitHub^1.

What is PCA?

Let's start with the theory. I won't go into too much detail because there are plenty of great resources out there if you want to understand how PCA works^2^3. It is important to know that PCA is a dimensionality reduction algorithm. This means that it is used to reduce the number of features used to train the model. It achieves this by constructing principal components (PCs) from a number of features.

PCs are constructed in such a way that the first PC, i.e., PC1, accounts for most of the changes in the feature as much as possible. PC2 then explains most of the remaining changes as much as possible, and so on. PC1 and PC2 can often explain a large portion of the total characteristic variation. Another way to think about it is that the first two PCs summarize the characteristics very well. This is important because it allows us to visualize the classification capabilities of the data on a two-dimensional plane.

data set

Okay, let's dive into a practical example. We will use PCA to explore the breast cancer dataset^4, which we import using the following code. The target variable is the result of the breast cancer test – malignant or benign. Many cancer cells are removed with each test. 10 different measurements are then taken from each cancer cell. These include measurements such as cell radius and cell symmetry. To get the final list of 30 features, we aggregate these measurements in 3 ways. That is, we calculate the mean, standard error, and maximum ("worst" value) for each measurement. In Figure 1, we take a closer look at two of these features - the average symmetry and the worst smoothness of the cell.

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

data = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
data['y'] = cancer['target']

In Figure 1, we can see that these two features help distinguish between the two categories. That is, benign tumors tend to be more symmetrical and smoother. There is still a lot of overlap, so a model that uses only these features won't work well. We can create such charts to understand the predictive power of each individual feature. Despite the 30 characteristics, there are still a lot of charts to analyze. They also don't tell us about the predictive power of the entire data set. This is where PCA comes into play.

Figure 1: A scatter plot using two features

PCA – the entire data set

Let's start by PCA the entire data set. We use the code below to do this. Let's start by scaling the features so that they all have a mean of 0 and a variance of 1. This is important because PCA works by maximizing the variance interpreted by the PC. Due to their size, certain traits tend to have higher variance. For example, the variance of a distance measured in centimeters will be higher than the same distance measured in kilometers. Without scaling, PCA will be "overwhelmed" by features with large variance.

Once the scaling is complete, we fit the PCA model and convert the features to the PC. Since we have 30 characteristics, it is possible to have up to 30 PCs. For our visualizations, we are only interested in the first two. You can see this in Figure 2, where a scatter plot is created using PC1 and PC2. We can now see two distinct clusters, which are much clearer than in Figure 1.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#Scale the data
scaler = StandardScaler()
scaler.fit(data)
scaled = scaler.transform(data)

#Obtain principal components
pca = PCA().fit(scaled)

pc = pca.transform(scaled)
pc1 = pc[:,0]
pc2 = pc[:,1]

#Plot principal components
plt.figure(figsize=(10,10))

colour = ['#ff2121'if y == 1else'#2176ff'for y in data['y']]
plt.scatter(pc1,pc2 ,c=colour,edgecolors='#000000')
plt.ylabel("Glucose",size=20)
plt.xlabel('Age',size=20)
plt.yticks(size=12)
plt.xticks(size=12)
plt.xlabel('PC1')
plt.ylabel('PC2')

The graph can be used to visualize the predictive strength of the data. In this case, it shows that using the entire dataset will allow us to distinguish between malignant and benign tumors. However, there are still some outliers (i.e., points that are not explicitly located in the cluster). This does not mean that we will make wrong predictions about these scenarios. We should keep in mind that not all feature variances will be captured in the first two PCs. Models trained on the full set of features can produce better predictions.

Figure 2: PCA scatter plot using all features

At this point, we should mention a note for this approach. PC1 and PC2 can explain a large portion of the variance in the feature. However, this is not always true. In some cases, PC may be considered a poor summary of the characteristics. This means that even if your data can separate categories well, you may not be able to get a clear clustering, as shown in Figure 2.

We can use the PCA gravel map to determine if this will be an issue. We used the code below to create a gravel map for this analysis, as shown in Figure 3. This is a bar chart where the height of each bar is the percentage of variance interpreted by the relevant PC. We see that PC1 and PC2 together explain only about 20% of the characteristic variance. Even with only 20% explanations, we still get two different clusters. This underscores the predictive strength of the data.

var = pca.explained_variance_[0:10] #percentage of variance explained
labels = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10']

plt.figure(figsize=(15,7))
plt.bar(labels,var,)
plt.xlabel('Pricipal Component')
plt.ylabel('Proportion of Variance Explained')

Figure 3. Gravel diagram

PCA – Feature Group

We can also use this procedure to compare different sets of features. For example, let's say we have two sets of traits. Group 1 has all the features based on cellular symmetry and smoothness features. Whereas, group 2 has all the characteristics based on perimeter and concavity. We can use PCA to visualize which group is better suited for prediction.

group_1 = ['mean symmetry', 'symmetry error','worst symmetry',
'mean smoothness','smoothness error','worst smoothness']
        
group_2 = ['mean perimeter','perimeter error','worst perimeter', 
'mean concavity','concavity error','worst concavity']

Let's start by creating two sets of features. PCA was then performed separately for each group. This will give us two sets of PCs, and we choose PC1 and PC2 to represent each feature group. The results of this process can be seen in Figure 4.

For group 1, we can see that there is some separation, but there is still a lot of overlap. In contrast, group 2 has two different clusters. Therefore, from these plots, we expect the features in group 2 to be better predictors. A model trained with a set of 2 features should have a higher accuracy rate than a model trained with a set of 1 features. Now, let's test this hypothesis.

Figure 4: PCA scatter plot using feature groups

We use the following code to train a logistic regression model that uses two sets of features. In each case, we use 70% of the data to train the model and the remaining 30% to test the model. The test set accuracy for Group 1 was 74%, compared to 97% for Group 2. Therefore, the traits in group 2 are better predictors, which is exactly what we expected from the PCA results.

from sklearn.model_selection import train_test_split
import sklearn.metrics as metric
import statsmodels.api as sm

for i,g in enumerate(group):

    x = data[g]
    x = sm.add_constant(x)
    y = data['y']
    x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3, 
                                                        random_state = 101)

    model = sm.Logit(y_train,x_train).fit() #fit logistic regression model

    predictions = np.around(model.predict(x_test)) 
    accuracy = metric.accuracy_score(y_test,predictions)
    
    print("Accuracy of Group {}: {}".format(i+1,accuracy))

---
Optimization terminated successfully.
         Current function value: 0.458884
         Iterations 7
Accuracy of Group 1: 0.7368421052631579
Optimization terminated successfully.
         Current function value: 0.103458
         Iterations 10
Accuracy of Group 2: 0.9707602339181286

Finally, we'll look at how to use PCA to gain a deeper understanding of your data before you start modeling. It will give you an idea of the expected classification accuracy. You'll also build an intuition about which traits are predictive. This can give you an edge in feature selection.

As mentioned above, this approach is not foolproof. It should be used in conjunction with other data exploration graphs and summary statistics. For classification problems, these may include information values and boxplots. In general, it's a good idea to look at the data from as many different angles as possible before you start modeling.

reference

[^2]: Matt Brems, A One-Stop Shop for Principal Component Analysis (2017), https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

[^3]: L. Pachter, What is principal component analysis? (2014), https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/

[^4]: UCI, Breast Cancer Wisconsin (Diagnostic) Dataset (2020), http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

Explore the effectiveness of data classification with PCA

What is PCA?

data set

PCA – the entire data set

PCA – Feature Group

reference