Skip to article frontmatterSkip to article content

What is a decision tree?

  • Internal nodes tt (circles).
    • Label AtA_t (splitting criterion).
    • For each At=jA_t = j (outcome), an edge to child(t,j)\op{child}(t, j) (child node).
  • Leaf nodes (squares).
  • Label class(t)\op{class}(t) (decision).

How to classify?

  • Trace from root to leaves.

How to build a decision stump?

  • A decision stump is a decision tree with depth 1\leq 1.
  • Choose a splitting attribute.
  • Use majority voting to determine class(t)\op{class}(t).
  • Which decision stump is better? Left/right because of o__________.

Binary splits for numeric attributes

  • C__________ m__-points as s________ points.
  • Which is/are the best split(s)? left/middle/right.
  • How to build a tree instead of a stump? R__________ split (d_____-and-c______).

How to build a decision tree?

How to find good splitting attribute?

  • Given the data DD to split, choose the splitting attribute AA that minimizes e____ of decision stump by AA.
  • What is the precise formula?
  • DjD_j: set {(x,y)DA=j for x}\Set{(\M{x}, y) \in D \mid A = j \text{ for } \M{x}} of tuples in DD with A=jA = j.
  • pkjp_{k|j}: fraction {(x,y)Djy=k}Dj\frac{\abs{\Set{(\M{x}, y) \in D_j \mid y = k }}}{\abs{D_j}} of tuples in D_jD\_j belonging to class kk.

Example

  • What is the best splitting attribute? X1/X2/same\underline{\R{X}_1/\R{X}_2/\text{same}}.

Further split on X3\R{X}_3

Issue of greedy algorithm

  • Locally optimal split may not be g_______ optimal.
  • Why splitting on X1\R{X}_1 is not good? Child nodes of X1\R{X}_1 are less p___.
  • Why misclassification rate fails? It neglects the distribution of the class values of m____________ instances.

How to remain greedy but not myopic?

  • Find better i measures than misclassification rate.
  • How to measure impurity?
  • E.g., order the following distributions in ascending order of impurities:

    ___ (purest) < ___ < ___ < ___

  • Given a distribution pkp_k of the class values of DD, how to define a non-negative function of pkp_k’s that respect the above ordering?
    • 1maxkpk1 - \max_k p_k works? Yes/No
    • 1kpk1 - \sum_k p_k works? Yes/No

Gini Impurity Index

Why it works?

  • g(p0,p1,)0g(p_0, p_1, \ldots) \geq 0. Equality iff k,pk{0,1}\forall k, p_k \in \{0, 1\}. Why?
  • g(p0,p1,,pn)11/ng(p_0, p_1, \ldots, p_n) \leq 1 - 1/n. Equality iff pk=xxp_k = \underline{\phantom{\frac{x}{x}}}. Why?

Finding the best split using Gini impurity

  • Minimize the Gini impurity given a AA:
  • What is the best splitting attribute? X1/X2/same\underline{\R{X}_1/\R{X}_2/\text{same}}.

An impurity measure from information theory

  • Shannon’s entropy
  • Measured in bits with base-2 logarithm. Why?
  • 0log00 \log 0 is regarded as limp0plogp\lim_{p \to 0} p \log p even though log0\log 0 is undefined.

Why it works?

  • h(p0,p1,)0h(p_0, p_1, \ldots) \geq 0. Equality iff k,pk{0,1}\forall k, p_k \in \Set{0, 1}. Why?
  • h(p0,p1,,pn)log2nh(p_0, p_1, \ldots, p_n) \leq \log_2 n. Equality iff pk=xxp_k = \underline{\phantom{\frac{x}{x}}}. Why?

Finding the best split by conditional entropy

Minimize the entropy given AA:

  • What is the best splitting attribute? X1/X2/same\underline{\R{X}_1/\R{X}_2/\text{same}}.

Which impurity measure is used?

  • ID3 (Iterative Dichotomiser 3) maximizes

    GainA(D):=Info(D)InfoA(D)(information gain or mutual information)\begin{align} \op{Gain}_A(D) := \op{Info}(D) - \op{Info}_A(D) && \text{(information gain or mutual information)} \end{align}
  • CART (Classification and Regression Tree)

    ΔGiniA(D):=Gini(D)GiniA(D)(Drop in Gini impurity)\begin{align} \Delta \op{Gini}_A(D) := \op{Gini}(D) - \op{Gini}_A(D) && \text{(Drop in Gini impurity)} \end{align}
  • Is X4X_4 a good splitting attribute? Yes/No.

Bias towards attributes with many outcomes

  • An attribute with more outcomes tends to
  • reduce impurity more but
  • result in more comparisons.
  • Issues: Such attribute may not minimize impurity per comparison.
  • Remedies?

Binary split also for nominal attributes

  • CART uses a s____________ SS to generate a binary split (whether ASA \in S).
  • The number of outcomes is therefore limited to ___.

Normalization by split information

  • C4.5/J48 allows m________ split but uses information gain ratio:
    GainA(D)SplitInfoA(D)\frac{\op{Gain}_A(D)}{\op{SplitInfo}_A(D)}
    where SplitInfoA(D)=jDjDlog21Dj/D\op{SplitInfo}_A(D) = \sum_j \frac{\abs{D_j}}{\abs{D}} \log_2 \frac{1}{\abs{D_j} / \abs{D}}.
  • SplitInfoA(D)\op{SplitInfo}_A(D) is the entropy of __________ because __________________.
  • Attributes with many outcomes tend to have smaller/larger SplitInfoA(D)\op{SplitInfo}_A(D).

How to avoid overfitting

  • P__-pruning: Limit the size of the tree as we build it. E.g.,
    • Ensure each node is supported by enough examples. (C4.5: minimum number of objects.)
    • Split only if we are confident enough about the improvement. (C4.5: confidence factor.)
  • P___-pruning: Reduce the size of the tree after we build it. E.g.,
    • Contract leaf nodes if complexity outweighs the risk. (CART: cost-complexity pruning)

References