Supervised learning#

We have the human operator labels for our data, so we can train models to predict these labels instead of using them only to check our model.

Examples of supervised learning include linear regression, support vector machines, neural networks (including deep learning), and more.

The simplest possible supervised learning model for our classification problems is a k-nearest-neighbors classifier (kNN).

kNN assigns labels based on the labels of a sample’s k nearest neighbors. Let’s try it out!

import pandas as pd

df = pd.read_csv("pellets-visual-classes-rgb.csv", index_col="image").dropna()
df["yellowing index"] = df["yellowing index"].astype(int)

We already know that the size is mostly random so let’s drop it here.

feature_columns = ["r", "g", "b"]
X = df[feature_columns].values

y = df["yellowing"].values

from sklearn import neighbors

knn = neighbors.KNeighborsClassifier()
knn.fit(X, y)
prediction = knn.predict(X)
prediction

from sklearn import metrics

metrics.accuracy_score(y, prediction)

Quite the improvement from our k-means attempt. Aren’t we forgeting anything though? Yes, we should always standardize the data!

from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

knn = neighbors.KNeighborsClassifier()
knn.fit(X_scaled, y)
prediction_scaled = knn.predict(X_scaled)

metrics.accuracy_score(y, prediction_scaled)

The lower score means that we where overfitting before standardizing. Still, ~78% is much better than our k-means.

import seaborn as sns

redux = df[["r", "g", "b", "yellowing"]]
redux = redux.assign(knn=prediction_scaled)

sns.pairplot(redux, hue="knn", vars=feature_columns)

How can we stop forgetting to standardize the data? Well, scikit-learn is awesome and has our back. We can create data processing pipelines and keep all the steps of our model in a single object. Pipelines are very robust and may contain custom steps if your data requires them

from sklearn import pipeline

classifier = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    neighbors.KNeighborsClassifier(),
)

classifier.fit(X, y)
prediction_pipeline = classifier.predict(X)

metrics.accuracy_score(y, prediction_pipeline)

Validation#

from sklearn import model_selection

split = model_selection.train_test_split(X, y)

X_train, X_test, y_train, y_test = split

X_train.shape, X_test.shape

classifier.fit(X_train, y_train)

metrics.accuracy_score(y_test, classifier.predict(X_test))

scores = model_selection.cross_val_score(classifier, X, y)
scores

scores.mean()

We reduce our accuracy when performing a test/train split. Why that happened? The first guess is that our model may be “data hungry.” We just don’t have enough samples on each class to predict them.

df["yellowing"].value_counts().plot.barh(title="yellowing")

The data is balanced! That can we do next?

Try to balance the current data;
Collect more data and see if the dataset balance itself out;
Choose a technique that is more robust to unbalanced data, like Decision Trees (DT).

PS: Check this awesome paper on DTs.