"Decision Trees are versatile Machine Learning algorithms that can perform both classification and regression tasks, and even multioutput tasks." [Geron2017]
%matplotlib inline
import numpy as np
np.random.seed(42) # to ensure our results exactly like the book
First, let's load the iris dataset from sci-kit learn library.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
Let's determine which columns will be in our interest and print them.
X = iris.data[:, 2:] # only focus on petal length and width
Y = iris.target
feature_names = iris.feature_names[2:]
print("given:",feature_names,
"\npredict whether:", iris.target_names)
Plot the data set and have a look at the two features that are selected.
# use matplotlib as you did on previous labs
Without separating the dataset as we did in previous labs, let's use all the data set and train the decision tree.
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,Y)
There are many hyperparameters that a decision tree classifier has. You can see from the ouput above the parameters that will be used for predictions. Two criterions that you can use with decision trees in sci-kit learn. These metrics are calculated in each node of decision tree.
criterion='gini'
is a measure of how often a randomly chosen element from a set would be incorrectly labeled. Formally it is computed by:
$$
I_G(p) = \sum_{i=1}^{J} p_i \sum_{k \neq i} p_k
$$
where $J$ denotes classes and $p_i$ is the fraction of items which are labeled with class $i$. For a concrete example have a look: https://stats.stackexchange.com/a/339514criterion='entropy'
is a mesaure of entropy, which is used in thermodynamics as a measure of molecular disorder. Entropy=0 means the molecules are well ordered.These two metrics are used for deciding the splits while training a decision tree.
Your answer here
You can export the decision tree as a dot
file from sci-kit learn. You can convert dot
to png
image by installing graphviz
.
from sklearn.tree import export_graphviz
export_graphviz(tree_clf,
out_file="iris_tree.dot",
feature_names=feature_names,
class_names=iris.target_names,
rounded=True,
filled=True
)
# Make sure you installed graphviz (exclamation mark is for shell commands)
!apt install graphviz
# Convert dot file to png file.
!dot -Tpng iris_tree.dot -o iris_tree.png
from IPython.display import Image
Image(filename='iris_tree.png')
To see a better visualization example of decision trees, have a look at this page.
There is a brand new visualization library from creators of ANTLR (parser generator) for decision trees called dtreeviz
. You can find some other examples from their repository for better visualization. Follow the steps below:
# install the package
!pip install dtreeviz
# (optional)
!apt-get install msttcorefonts -qq
from dtreeviz.trees import dtreeviz
import matplotlib as mpl
mpl.rcParams['axes.facecolor'] = 'white'
viz = dtreeviz(tree_clf,
X,
Y,
target_name='flower type',
feature_names=feature_names,
class_names=list(iris.target_names),
fancy=True,
orientation ='TD')
# uncomment this
# viz
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
CUSTOM_CMAP = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
# helper function to plot the boundaries
def plot_decision_boundary(clf, x, y):
color_map = ["yo", "bs", "g^"]
for target_index, target_name in enumerate(iris.target_names):
plt.plot(x[:, 0][y==target_index], # petal length on X axis (the ones that equal to target)
x[:, 1][y==target_index], # petal width on Y axis (the ones that equal to target)
color_map[target_index],
label=target_name)
x1s = np.linspace(np.min(x[:, 0]), np.max(x[:, 0]), 100)
x2s = np.linspace(np.min(x[:, 1]), np.max(x[:, 1]), 100)
x1, x2 = np.meshgrid(x1s, x2s)
x_test = np.c_[x1.ravel(), x2.ravel()]
y_pred = clf.predict(x_test).reshape(x1.shape)
plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=CUSTOM_CMAP)
plot_decision_boundary(tree_clf, X, Y)
plt.xlabel(feature_names[0]) # petal length (cm)
plt.ylabel(feature_names[1]) # petal width (cm)
plt.show()
max_depth
of decision tree classifier and observe the changes in the decision boundaries. What would you set max_depth
?plot_decision_boundary
) was not available to you, how would you visualize the decision boundaries? Tip: try to create X's that ranges from $[1, 0]$ to $[7, 2.5]$ where. You can use: # check np.mgrid[minX1:maxX1:increment, minX2:maxX2:increment]
X = np.mgrid[0:10:1, -5:0:1].reshape(2,-1).T
X
# Tip: try each point in the space
To estimate the probability of an instance belongs to a class, you can use predict_proba
, to determine the class that an instance will be assigned to use predict
.
tree_clf.predict_proba([[5, 1.5]])
tree_clf.predict([[5, 1.5]])
"Constraining a model to make it simpler and reduce the risk of overfitting is called _regularization" [Geron2017, page 27] To avoid overfitting, you can limit the generation of a node by min_samples_leaf
(the minimum samples that a node must have to able to be splitted.).
from sklearn.datasets import make_moons
Xm, ym = make_moons(n_samples=100, noise=0.25, random_state=53)
deep_tree_clf1 = DecisionTreeClassifier(random_state=42)
deep_tree_clf2 = DecisionTreeClassifier(min_samples_leaf=4, random_state=42)
deep_tree_clf1.fit(Xm, ym)
deep_tree_clf2.fit(Xm, ym)
plt.figure(figsize=(11, 4))
plt.subplot(121)
plt.xlabel(r"$x_1$", fontsize=18)
plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
plot_decision_boundary(deep_tree_clf1, Xm, ym)
plt.title("No restrictions", fontsize=16)
plt.subplot(122)
plt.xlabel(r"$x_1$", fontsize=18)
plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
plot_decision_boundary(deep_tree_clf2, Xm, ym)
plt.title("min_samples_leaf = {}".format(deep_tree_clf2.min_samples_leaf), fontsize=14)
plt.show()
Decision trees can be used for regression tasks too. Instead of predicting a class, in regression tasks, the aim is to predict a numberic value (such as the price of a car). Assume that we have this quadratic data set with some noise:
# Quadratic training set + noise
np.random.seed(42)
m = 200
X = np.random.rand(m, 1)
y = 4 * (X - 0.5) ** 2
Y = y + np.random.randn(m, 1) / 10
plt.plot(X, Y, "bo")
plt.xlabel("$x_{1}$", fontsize=18)
plt.ylabel("$y$", fontsize=18, rotation=0)
plt.show()
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X,Y)
np.linspace(min, max, noOfPoints)
)max_depth=2
and max_depth=3
regression trees (also try min_samples_leaf=10
)# Tips for 6.2.2:
Xs = np.linspace(0, 1, 100)
Xs
# predict Y values for Xs and plot
Instead of using a single predictor, to improve our predictions we use now use an ensemble: a group of predictors. It is as if you are asking a number of experts opinion about a problem and you aggregate their answers.
A brief comparision between soft
voting and hard
voting with using three predictors.
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
X, Y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=42)
Helper function for printing accuracies on test set
from sklearn.metrics import accuracy_score
def test_clfs(*clfs): # clf -> classifier
for clf in clfs:
clf.fit(X_train, Y_train) # train the classifier
Y_pred = clf.predict(X_test)
print(clf.__class__.__name__ + ":", accuracy_score(Y_test, Y_pred))
Let's test hard
voting first.
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Don't worry about the warnings,
# sci-kit community will be fixing it in the next major version 0.20.0
log_clf = LogisticRegression(random_state=42)
rnd_clf = RandomForestClassifier(random_state=42)
svm_clf = SVC(random_state=42, probability=True)
voting_clf = VotingClassifier(estimators=[('lr', log_clf),
('rf', rnd_clf),
('svc', svm_clf)],
voting='hard')
test_clfs(log_clf, rnd_clf, svm_clf, voting_clf)
soft
voting and compare the results. Why do you think it is different? If you don't know the difference have a look at your book, page 186.
Your answer here:
Instead of giving training set to each predictor in our ensemble, another approach to gain more accuracy is to separate the training set and give a different training subset to each predictor. There are two ways:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# define our decision tree classifier
tree_clf = DecisionTreeClassifier(random_state=42)
# 500 copies of the predictor, which has 100 samples from training set
# n_jobs=-1 for utilizing all cores
bag_clf = BaggingClassifier(tree_clf,
n_estimators=500,
max_samples=100,
bootstrap=True,
n_jobs=-1,
random_state=42)
# fit the bagging classifier
bag_clf.fit(X_train, Y_train)
tree_clf.fit(X_train, Y_train)
Y_pred_bag = bag_clf.predict(X_test)
Y_pred_tree = tree_clf.predict(X_test)
from sklearn.metrics import accuracy_score
print("Bagging Classifier")
print(accuracy_score(Y_test, Y_pred_bag))
print("Decision Tree Classifier")
print(accuracy_score(Y_test, Y_pred_tree))
plt.figure(figsize=(11,4))
plt.subplot(121)
plot_decision_boundary(tree_clf, X, Y)
plt.title("Decision Tree", fontsize=14)
plt.subplot(122)
plot_decision_boundary(bag_clf, X, Y)
plt.title("Decision Trees with Bagging", fontsize=14)
plt.show()
In Gradient Boosting, we start with a single predictor. Sequentially, we add new predictors to an ensemble that corrects its predecessor. In detail:
The algorithm is for 3 predictors
in our ensemble:
3
.y2 = y - predictor1.predict(X)
)n
predictors.quadratic training set
as a regression task.DecisionTreeRegressor(max_depth=2)
as a weak predictorGradientBoostingRegressor
and compare if your results are similiar.Tips:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X,Y)
# Your implementation here
At this point, we demonstrated these concepts:
Of course, there are some material that we have not be able to cover. In your free time, it can be better to have a look at:
Stacking
?