Decision Tree Induction - CS5483

Skip to article frontmatter Skip to article content

What is a decision tree?¶

Internal nodes $t$ (circles).
- Label $A_t$ (splitting criterion).
- For each $A_t = j$ (outcome), an edge to $\op{child}(t, j)$ (child node).
Leaf nodes (squares).
Label $\op{class}(t)$ (decision).

How to classify?¶

Trace from root to leaves.

How to build a decision stump?¶

A decision stump is a decision tree with depth $\leq 1$ .
Choose a splitting attribute.
Use majority voting to determine $\op{class}(t)$ .
Which decision stump is better? Left/right because of o__________.

Binary splits for numeric attributes¶

C__________ m__-points as s________ points.
Which is/are the best split(s)? left/middle/right.
How to build a tree instead of a stump? R__________ split (d_____-and-c______).

How to build a decision tree?¶

Greedy algorithm (See Han11 Fig 8.3 for the full version)

How to find good splitting attribute?¶

Given the data $D$ to split, choose the splitting attribute $A$ that minimizes e____ of decision stump by $A$ .
What is the precise formula?

$D_j$ : set $\Set{(\M{x}, y) \in D \mid A = j \text{ for } \M{x}}$ of tuples in $D$ with $A = j$ .
$p_{k|j}$ : fraction $\frac{\abs{\Set{(\M{x}, y) \in D_j \mid y = k }}}{\abs{D_j}}$ of tuples in $D\_j$ belonging to class $k$ .

Example¶

What is the best splitting attribute? $\underline{\R{X}_1/\R{X}_2/\text{same}}$ .

Further split on $\R{X}_3$ ¶

Issue of greedy algorithm¶

Locally optimal split may not be g_______ optimal.

Why splitting on $\R{X}_1$ is not good? Child nodes of $\R{X}_1$ are less p___.
Why misclassification rate fails? It neglects the distribution of the class values of m____________ instances.

How to remain greedy but not myopic?¶

Find better i measures than misclassification rate.
How to measure impurity?

E.g., order the following distributions in ascending order of impurities:
___ (purest) < ___ < ___ < ___

Given a distribution $p_k$ of the class values of $D$ , how to define a non-negative function of $p_k$ ’s that respect the above ordering?
- $1 - \max_k p_k$ works? Yes/No
- $1 - \sum_k p_k$ works? Yes/No

Gini Impurity Index¶

Why it works?¶

$g(p_0, p_1, \ldots) \geq 0$ . Equality iff $\forall k, p_k \in \{0, 1\}$ . Why?

$g(p_0, p_1, \ldots, p_n) \leq 1 - 1/n$ . Equality iff $p_k = \underline{\phantom{\frac{x}{x}}}$ . Why?

Finding the best split using Gini impurity¶

Minimize the Gini impurity given a $A$ :

What is the best splitting attribute? $\underline{\R{X}_1/\R{X}_2/\text{same}}$ .

An impurity measure from information theory¶

Shannon’s entropy

Measured in bits with base-2 logarithm. Why?
$0 \log 0$ is regarded as $\lim_{p \to 0} p \log p$ even though $\log 0$ is undefined.

Why it works?¶

$h(p_0, p_1, \ldots) \geq 0$ . Equality iff $\forall k, p_k \in \Set{0, 1}$ . Why?

$h(p_0, p_1, \ldots, p_n) \leq \log_2 n$ . Equality iff $p_k = \underline{\phantom{\frac{x}{x}}}$ . Why?

Finding the best split by conditional entropy¶

Minimize the entropy given $A$ :

What is the best splitting attribute? $\underline{\R{X}_1/\R{X}_2/\text{same}}$ .

Which impurity measure is used?¶

ID3 (Iterative Dichotomiser 3) maximizes
$\begin{align} \op{Gain}_A(D) := \op{Info}(D) - \op{Info}_A(D) && \text{(information gain or mutual information)} \end{align}$
(1)

CART (Classification and Regression Tree)
$\begin{align} \Delta \op{Gini}_A(D) := \op{Gini}(D) - \op{Gini}_A(D) && \text{(Drop in Gini impurity)} \end{align}$
(2)

Is $X_4$ a good splitting attribute? Yes/No.

Bias towards attributes with many outcomes¶

An attribute with more outcomes tends to
reduce impurity more but
result in more comparisons.
Issues: Such attribute may not minimize impurity per comparison.
Remedies?

Binary split also for nominal attributes¶

CART uses a s____________ $S$ to generate a binary split (whether $A \in S$ ).
The number of outcomes is therefore limited to ___.

Normalization by split information¶

C4.5/J48 allows m________ split but uses information gain ratio:
$\frac{\op{Gain}_A(D)}{\op{SplitInfo}_A(D)}$
(3)
where $\op{SplitInfo}_A(D) = \sum_j \frac{\abs{D_j}}{\abs{D}} \log_2 \frac{1}{\abs{D_j} / \abs{D}}$ .
$\op{SplitInfo}_A(D)$ is the entropy of __________ because __________________.
Attributes with many outcomes tend to have smaller/larger $\op{SplitInfo}_A(D)$ .

How to avoid overfitting¶

P__-pruning: Limit the size of the tree as we build it. E.g.,
- Ensure each node is supported by enough examples. (C4.5: minimum number of objects.)
- Split only if we are confident enough about the improvement. (C4.5: confidence factor.)
P___-pruning: Reduce the size of the tree after we build it. E.g.,
- Contract leaf nodes if complexity outweighs the risk. (CART: cost-complexity pruning)

References¶

8.1 Basic Concepts
8.2 Decision Tree Induction
Optional readings
- C4.5 algorithm
- Cover, T., & Thomas, J. (2006). Elements of information theory (2nd ed.). Hoboken, N.J.: Wiley-Interscience. Chapter 1 and 2.

Evaluation with scikit-learn

Segment Challenge