"Decision Trees are versatile Machine Learning algorithms that can perform both classification and regression tasks, and even multioutput tasks." [Geron2017]
%matplotlib inline
import numpy as np
np.random.seed(42) # to ensure our results exactly like the book
First, let's load the iris dataset from sci-kit learn library.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
Let's determine which columns will be in our interest and print them.
X = iris.data[:, 2:] # only focus on petal length and width
Y = iris.target
feature_names = iris.feature_names[2:]
print("given:",feature_names,
"\npredict whether:", iris.target_names)
Plot the data set and have a look at the two features that are selected.
# use matplotlib as you did on previous labs
Without separating the dataset as we did in previous labs, let's use all the data set and train the decision tree.
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,Y)
There are many hyperparameters that a decision tree classifier has. You can see from the ouput above the parameters that will be used for predictions. Two criterions that you can use with decision trees in sci-kit learn. These metrics are calculated in each node of decision tree.
criterion='gini' is a measure of how often a randomly chosen element from a set would be incorrectly labeled. Formally it is computed by:
$$
I_G(p) = \sum_{i=1}^{J} p_i \sum_{k \neq i} p_k
$$
where $J$ denotes classes and $p_i$ is the fraction of items which are labeled with class $i$. For a concrete example have a look: https://stats.stackexchange.com/a/339514criterion='entropy' is a mesaure of entropy, which is used in thermodynamics as a measure of molecular disorder. Entropy=0 means the molecules are well ordered.These two metrics are used for deciding the splits while training a decision tree.
Your answer here
You can export the decision tree as a dot file from sci-kit learn. You can convert dot to png image by installing graphviz.
from sklearn.tree import export_graphviz
export_graphviz(tree_clf,
out_file="iris_tree.dot",
feature_names=feature_names,
class_names=iris.target_names,
rounded=True,
filled=True
)
# Make sure you installed graphviz (exclamation mark is for shell commands)
!apt install graphviz
# Convert dot file to png file.
!dot -Tpng iris_tree.dot -o iris_tree.png
from IPython.display import Image
Image(filename='iris_tree.png')
To see a better visualization example of decision trees, have a look at this page.
There is a brand new visualization library from creators of ANTLR (parser generator) for decision trees called dtreeviz. You can find some other examples from their repository for better visualization. Follow the steps below:
# install the package
!pip install dtreeviz
# (optional)
!apt-get install msttcorefonts -qq
from dtreeviz.trees import dtreeviz
import matplotlib as mpl
mpl.rcParams['axes.facecolor'] = 'white'
viz = dtreeviz(tree_clf,
X,
Y,
target_name='flower type',
feature_names=feature_names,
class_names=list(iris.target_names),
fancy=True,
orientation ='TD')
# uncomment this
# viz
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
CUSTOM_CMAP = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
# helper function to plot the boundaries
def plot_decision_boundary(clf, x, y):
color_map = ["yo", "bs", "g^"]
for target_index, target_name in enumerate(iris.target_names):
plt.plot(x[:, 0][y==target_index], # petal length on X axis (the ones that equal to target)
x[:, 1][y==target_index], # petal width on Y axis (the ones that equal to target)
color_map[target_index],
label=target_name)
x1s = np.linspace(np.min(x[:, 0]), np.max(x[:, 0]), 100)
x2s = np.linspace(np.min(x[:, 1]), np.max(x[:, 1]), 100)
x1, x2 = np.meshgrid(x1s, x2s)
x_test = np.c_[x1.ravel(), x2.ravel()]
y_pred = clf.predict(x_test).reshape(x1.shape)
plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=CUSTOM_CMAP)
plot_decision_boundary(tree_clf, X, Y)
plt.xlabel(feature_names[0]) # petal length (cm)
plt.ylabel(feature_names[1]) # petal width (cm)
plt.show()
max_depth of decision tree classifier and observe the changes in the decision boundaries. What would you set max_depth?plot_decision_boundary) was not available to you, how would you visualize the decision boundaries? Tip: try to create X's that ranges from $[1, 0]$ to $[7, 2.5]$ where. You can use: # check np.mgrid[minX1:maxX1:increment, minX2:maxX2:increment]
X = np.mgrid[0:10:1, -5:0:1].reshape(2,-1).T
X
# Tip: try each point in the space
To estimate the probability of an instance belongs to a class, you can use predict_proba, to determine the class that an instance will be assigned to use predict.
tree_clf.predict_proba([[5, 1.5]])
tree_clf.predict([[5, 1.5]])
"Constraining a model to make it simpler and reduce the risk of overfitting is called _regularization" [Geron2017, page 27] To avoid overfitting, you can limit the generation of a node by min_samples_leaf (the minimum samples that a node must have to able to be splitted.).
from sklearn.datasets import make_moons
Xm, ym = make_moons(n_samples=100, noise=0.25, random_state=53)
deep_tree_clf1 = DecisionTreeClassifier(random_state=42)
deep_tree_clf2 = DecisionTreeClassifier(min_samples_leaf=4, random_state=42)
deep_tree_clf1.fit(Xm, ym)
deep_tree_clf2.fit(Xm, ym)
plt.figure(figsize=(11, 4))
plt.subplot(121)
plt.xlabel(r"$x_1$", fontsize=18)
plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
plot_decision_boundary(deep_tree_clf1, Xm, ym)
plt.title("No restrictions", fontsize=16)
plt.subplot(122)
plt.xlabel(r"$x_1$", fontsize=18)
plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
plot_decision_boundary(deep_tree_clf2, Xm, ym)
plt.title("min_samples_leaf = {}".format(deep_tree_clf2.min_samples_leaf), fontsize=14)
plt.show()
Decision trees can be used for regression tasks too. Instead of predicting a class, in regression tasks, the aim is to predict a numberic value (such as the price of a car). Assume that we have this quadratic data set with some noise:
# Quadratic training set + noise
np.random.seed(42)
m = 200
X = np.random.rand(m, 1)
y = 4 * (X - 0.5) ** 2
Y = y + np.random.randn(m, 1) / 10
plt.plot(X, Y, "bo")
plt.xlabel("$x_{1}$", fontsize=18)
plt.ylabel("$y$", fontsize=18, rotation=0)
plt.show()
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X,Y)
np.linspace(min, max, noOfPoints))max_depth=2 and max_depth=3 regression trees (also try min_samples_leaf=10)# Tips for 6.2.2:
Xs = np.linspace(0, 1, 100)
Xs
# predict Y values for Xs and plot
Instead of using a single predictor, to improve our predictions we use now use an ensemble: a group of predictors. It is as if you are asking a number of experts opinion about a problem and you aggregate their answers.
A brief comparision between soft voting and hard voting with using three predictors.
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
X, Y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=42)
Helper function for printing accuracies on test set
from sklearn.metrics import accuracy_score
def test_clfs(*clfs): # clf -> classifier
for clf in clfs:
clf.fit(X_train, Y_train) # train the classifier
Y_pred = clf.predict(X_test)
print(clf.__class__.__name__ + ":", accuracy_score(Y_test, Y_pred))
Let's test hard voting first.
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Don't worry about the warnings,
# sci-kit community will be fixing it in the next major version 0.20.0
log_clf = LogisticRegression(random_state=42)
rnd_clf = RandomForestClassifier(random_state=42)
svm_clf = SVC(random_state=42, probability=True)
voting_clf = VotingClassifier(estimators=[('lr', log_clf),
('rf', rnd_clf),
('svc', svm_clf)],
voting='hard')
test_clfs(log_clf, rnd_clf, svm_clf, voting_clf)
soft voting and compare the results. Why do you think it is different? If you don't know the difference have a look at your book, page 186.
Your answer here:
Instead of giving training set to each predictor in our ensemble, another approach to gain more accuracy is to separate the training set and give a different training subset to each predictor. There are two ways:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# define our decision tree classifier
tree_clf = DecisionTreeClassifier(random_state=42)
# 500 copies of the predictor, which has 100 samples from training set
# n_jobs=-1 for utilizing all cores
bag_clf = BaggingClassifier(tree_clf,
n_estimators=500,
max_samples=100,
bootstrap=True,
n_jobs=-1,
random_state=42)
# fit the bagging classifier
bag_clf.fit(X_train, Y_train)
tree_clf.fit(X_train, Y_train)
Y_pred_bag = bag_clf.predict(X_test)
Y_pred_tree = tree_clf.predict(X_test)
from sklearn.metrics import accuracy_score
print("Bagging Classifier")
print(accuracy_score(Y_test, Y_pred_bag))
print("Decision Tree Classifier")
print(accuracy_score(Y_test, Y_pred_tree))
plt.figure(figsize=(11,4))
plt.subplot(121)
plot_decision_boundary(tree_clf, X, Y)
plt.title("Decision Tree", fontsize=14)
plt.subplot(122)
plot_decision_boundary(bag_clf, X, Y)
plt.title("Decision Trees with Bagging", fontsize=14)
plt.show()
In Gradient Boosting, we start with a single predictor. Sequentially, we add new predictors to an ensemble that corrects its predecessor. In detail:
The algorithm is for 3 predictors in our ensemble:
3.y2 = y - predictor1.predict(X))n predictors.quadratic training set as a regression task.DecisionTreeRegressor(max_depth=2) as a weak predictorGradientBoostingRegressor and compare if your results are similiar.Tips:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X,Y)
# Your implementation here
At this point, we demonstrated these concepts:
Of course, there are some material that we have not be able to cover. In your free time, it can be better to have a look at:
Stacking?