Hi All,

I just posted the results of PCA on the "Rock Type Classification" problem. In addition to the t-SNE algorithm, we can use the PCA technique to visualize the separation of the data points in 2D as well as in 3D.

PCA helps in visualizing the data in a low dimension (2D or 3D). The advantage of using PCA is that we can visualize how the data points are spread across the features in our dataset. However, we need to keep in mind that PCA applies to only linearly separable data.

PCA coupled with the cluster labels generated by the clustering variables helps us understand the data in a much better fashion. The distance between the points can be interpreted with respect to features as well.
Since our data is linearly separable, we can indeed use PCA for our problem.

You view the pull request here: https://github.com/Integradas/RockTypeClassification/pull/2

Dependencies - pip install pca

Posting the results I got from the PCA. Interpretation below:


1. Towards the direction of an arrow: As you can see, Rock Type 1 data points are towards the direction of DensityPorosity. When you move in the direction of the arrowhead of a feature, it means that the values of that feature for the data point increases. This means that Rock Type 1 had high values of Density Porosity.

2. Opposite to the direction of the arrow: When you move opposite to the direction of the arrowhead of a feature, it means that the data point has low unit values for that feature. For example, Rock Type 1 has low values for Resistivity.

3. Feature arrowheads orthogonal to each other: Orthogonal arrows mean that they are not related to each other. For example - Shale Volume is not related to Sonic

4. 2 Feature arrows in the same direction: This means they are highly correlated to each other. For example - Shale Volume and GammaRay. Consider removing them in case of classification problems (if we take the labels created by this unsupervised learning to predict data points or rock types in the future)

5. 2 Feature arrows in opposite direction: This means they are negatively correlated. For example - Resistivity is negatively related to DensityPorosity

I'd love to hear what you think of this approach.