Supervised learning#
We have the human operator labels for our data, so we can train models to predict these labels instead of using them only to check our model.
Examples of supervised learning include linear regression, support vector machines, neural networks (including deep learning), and more.
The simplest possible supervised learning model for our classification problems is a k-nearest-neighbors classifier (kNN).
kNN assigns labels based on the labels of a sample’s k nearest neighbors. Let’s try it out!
import pandas as pd
df = pd.read_csv("pellets-visual-classes-rgb.csv", index_col="image").dropna()
df["yellowing index"] = df["yellowing index"].astype(int)
We already know that the size is mostly random so let’s drop it here.
feature_columns = ["r", "g", "b"]
X = df[feature_columns].values
y = df["yellowing"].values
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier()
knn.fit(X, y)
prediction = knn.predict(X)
prediction
from sklearn import metrics
metrics.accuracy_score(y, prediction)
Quite the improvement from our k-means attempt. Aren’t we forgeting anything though? Yes, we should always standardize the data!
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
knn = neighbors.KNeighborsClassifier()
knn.fit(X_scaled, y)
prediction_scaled = knn.predict(X_scaled)
metrics.accuracy_score(y, prediction_scaled)
The lower score means that we where overfitting before standardizing. Still, ~78% is much better than our k-means.
import seaborn as sns
redux = df[["r", "g", "b", "yellowing"]]
redux = redux.assign(knn=prediction_scaled)
sns.pairplot(redux, hue="knn", vars=feature_columns)
How can we stop forgetting to standardize the data? Well, scikit-learn is awesome and has our back. We can create data processing pipelines and keep all the steps of our model in a single object. Pipelines are very robust and may contain custom steps if your data requires them
from sklearn import pipeline
classifier = pipeline.make_pipeline(
preprocessing.StandardScaler(),
neighbors.KNeighborsClassifier(),
)
classifier.fit(X, y)
prediction_pipeline = classifier.predict(X)
metrics.accuracy_score(y, prediction_pipeline)
Validation#
from sklearn import model_selection
split = model_selection.train_test_split(X, y)
X_train, X_test, y_train, y_test = split
X_train.shape, X_test.shape
classifier.fit(X_train, y_train)
metrics.accuracy_score(y_test, classifier.predict(X_test))
scores = model_selection.cross_val_score(classifier, X, y)
scores
scores.mean()
We reduce our accuracy when performing a test/train split. Why that happened? The first guess is that our model may be “data hungry.” We just don’t have enough samples on each class to predict them.
df["yellowing"].value_counts().plot.barh(title="yellowing")
The data is balanced! That can we do next?
Try to balance the current data;
Collect more data and see if the dataset balance itself out;
Choose a technique that is more robust to unbalanced data, like Decision Trees (DT).
PS: Check this awesome paper on DTs.