
How it works...
The encoder creates additional features for each categorical variable, and the value returned is a sparse matrix. The result is a sparse matrix by definition; each row of the new features has 0 everywhere, except for the column whose value is associated with the feature's category. Therefore, it makes sense to store this data in a sparse matrix. The cat_encoder is now a standard scikit-learn model, which means that it can be used again:
cat_encoder.transform(np.ones((3, 1))).toarray()
array([[ 0., 1., 0.],
[ 0., 1., 0.],
[ 0., 1., 0.]])
In the previous chapter, we turned a classification problem into a regression problem. Here, there are three columns:
- The first column is 1 if the flower is a Setosa and 0 otherwise
- The second column is 1 if the flower is a Versicolor and 0 otherwise
- The third column is 1 if the flower is a Virginica and 0 otherwise
Thus, we could use any of these three columns to create a regression similar to the one in the previous chapter; we will perform a regression to determine the degree of setosaness of a flower as a real number. The matching statement in classification is whether a flower is a Setosa one or not. This is the problem statement if we perform binary classification of the first column.
scikit-learn has the capacity for this type of multi-output regression. Compare it with multiclass classification. Let's try a simple one.
Import the ridge regression regularized linear model. It tends to be very well behaved because it is regularized. Instantiate a ridge regressor class:
from sklearn.linear_model import Ridge
ridge_inst = Ridge()
Now import a multi-output regressor that takes the ridge regressor instance as an argument:
from sklearn.multioutput import MultiOutputRegressor
multi_ridge = MultiOutputRegressor(ridge_inst, n_jobs=-1)
From earlier in this recipe, transform the target variable y to a three-part target variable, y_multi, with OneHotEncoder(). If X and y were part of a pipeline, the pipeline would transform the training and testing sets separately, and this is preferable:
from sklearn import preprocessing
cat_encoder = preprocessing.OneHotEncoder()
y_multi = cat_encoder.fit_transform(y.reshape(-1,1)).toarray()
Create training and testing sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_multi, stratify=y, random_state= 7)
Fit the multi-output estimator:
multi_ridge.fit(X_train, y_train)
MultiOutputRegressor(estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001), n_jobs=-1)
Predict the multi-output target on the testing set:
y_multi_pre = multi_ridge.predict(X_test)
y_multi_pre[:5]
array([[ 0.81689644, 0.36563058, -0.18252702],
[ 0.95554968, 0.17211249, -0.12766217],
[-0.01674023, 0.36661987, 0.65012036],
[ 0.17872673, 0.474319 , 0.34695427],
[ 0.8792691 , 0.14446485, -0.02373395]])
Use the binarize function from the previous recipe to turn each real number into the integers 0 or 1:
from sklearn import preprocessing
y_multi_pred = preprocessing.binarize(y_multi_pre,threshold=0.5)
y_multi_pred[:5]
array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 0., 0.],
[ 1., 0., 0.]])
We can measure the overall multi-output performance with the roc_auc_score:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_multi_pre)
0.91987179487179482
Or, we can do it flower type by flower type, column by column:
from sklearn.metrics import accuracy_score
print ("Multi-Output Scores for the Iris Flowers: ")
for column_number in range(0,3):
print ("Accuracy score of flower " + str(column_number),accuracy_score(y_test[:,column_number], y_multi_pred[:,column_number]))
print ("AUC score of flower " + str(column_number),roc_auc_score(y_test[:,column_number], y_multi_pre[:,column_number]))
print ("")
Multi-Output Scores for the Iris Flowers:
('Accuracy score of flower 0', 1.0)
('AUC score of flower 0', 1.0)
('Accuracy score of flower 1', 0.73684210526315785)
('AUC score of flower 1', 0.76923076923076927)
('Accuracy score of flower 2', 0.97368421052631582)
('AUC score of flower 2', 0.99038461538461542)