Rule-Based Classification

Motivation¶

When is the decision equal to 1?
1. If _____________________, then $\R{Y}=1$ .
2. Else $\R{Y}=0$ .

Benefits representing knowledge by rules: (c.f. decision tree or NN)
- M____________________________
- I_____________________________
How to generate rules?

Each path from root to leaf corresponds to a rule:
1. $\R{X}_1 = \underline{\phantom{x}} \Rightarrow \R{Y} = 0$
2. $\R{X}_1 = \underline{\phantom{x}}, \R{X}_2 = \underline{\phantom{x}} \Rightarrow \R{Y} = 0$
3. $\R{X}_1 = \underline{\phantom{x}}, \R{X}_2 = \underline{\phantom{x}} \Rightarrow \R{Y} = 1$
Does the ordering of these rules matter? Yes/No because__________________________________________________________________

S________-and-c________ (c.f. divide-and-conquer)
- Learn a good rule.
- Remove covered instances and repeat 1 until all instances are covered.
How to learn a good rule?

PART (partial tree) decision list:
1. Build a new decision tree (by C4.5) and extract the rule that maximizes coverage: fraction of instances satisfying the antecedent.
2. Remove covered instances and repeat 1 until all instances are covered.

Rule 1: ________________
1. $\R{X}_1 = 0 \Rightarrow \R{Y} = 0 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$
2. $\R{X}_1 = 1, \R{X}_2 = 0 \Rightarrow \R{Y} = 0 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$
3. $\R{X}_1 = 1, \R{X}_2 = 1 \Rightarrow \R{Y} = 1 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$
Rule 2: ________________
1. $\R{X}_2 = 0 \Rightarrow \R{Y} = 0 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$
2. $\R{X}_2 = 1 \Rightarrow \R{Y} = 1 \quad (\text{coverage:} \underline{\phantom{xxx}} \%)$
Default rule: $\R{Y} = \underline{\phantom{xxx}}$
Issue: [Time complexity] _______________________________________

Start with ZeroR, add conjuncts to improve confidence: fraction of correctly classified instances.
- Rule 1: $\R{Y} = 0$
  - Confidence: $\underline{\phantom{xxx}} \%$
- Rule 1 (refined): $\R{X}_1 = 0 \Rightarrow \R{Y} = 0$
  - Confidence: $\underline{\phantom{xxx}} \%$
Repeatedly add new rules to cover remaining tuples
- Rule 2: $\R{Y} = 0$
  - Confidence: $\underline{\phantom{xxx}} \%$
- Rule 2 (refined): $\R{X}_2 = 0 \Rightarrow \R{Y} = 0$
  - Confidence: $\underline{\phantom{xxx}} \%$
- Default rule: $\R{Y} = \underline{\phantom{xxx}}$

Decision list
1. Rule 1: $\R{X}_1 = 0 \Rightarrow \R{Y} = 0$
2. Rule 2: $\R{X}_2 = 0 \Rightarrow \R{Y} = 0$
3. Default rule: $\R{Y} = 1$
Is the list best possible? Yes/No
1. Time to detect positive class: $\underline{\phantom{xxx}}$
2. Length of the list: $\underline{\phantom{xxx}}$

Learn rules for positive class first:
1. Rule 1:
  1. $\R{Y} = 1 \quad (\text{confidence:} \underline{\phantom{xxx}} \%)$
  2. $\R{X}_1 = \underline{\phantom{xxx}} \Rightarrow \R{Y} = 1 \quad (\text{confidence:} \underline{\phantom{xxx}} \%)$
  3. $\R{X}_1 = \underline{\phantom{xxx}}, \R{X}_2 = \underline{\phantom{xxx}} \Rightarrow \R{Y} = 1 \quad (\text{confidence:} \underline{\phantom{xxx}} \%)$
2. Default rule: $\R{Y} = \underline{\phantom{xxx}}$
Will the above guarantee a short decision list in general? Yes/No because $\underline{\phantom{xxx}}$

Add conjunct that maximizes
$\begin{align} \op{FOIL\_Gain} &= p' \left( \log \frac{p'}{p' + n'} - \log \frac{p}{p + n} \right) \end{align}$
(1)
- Change in the number of positives: $p \rightarrow p'$
- Change in the number of negatives: $n \rightarrow n'$

$\R{Y} = 1 \rightarrow \R{X}_1 = 0 \Rightarrow \R{Y} = 1$ :
$\op{FOIL\_Gain} = \underline{\phantom{\kern11em}}$
$\R{Y} = 1 \rightarrow \R{X}_1 = 1 \Rightarrow \R{Y} = 1$ :
$\op{FOIL\_Gain} = \underline{\phantom{\kern11em}}$
First/Second is better.

$\R{X}_1 = 1 \Rightarrow \R{Y} = 1 \rightarrow \R{X}_1 = 1, \R{X}_2 = 0 \Rightarrow \R{Y} = 1$ :
$\op{FOIL\_Gain} = \underline{\phantom{\kern11em}}$
$\R{X}_1 = 1 \Rightarrow \R{Y} = 1 \rightarrow \R{X}_1 = 1, \R{X}_2 = 1 \Rightarrow \R{Y} = 1$ :
$\op{FOIL\_Gain} = \underline{\phantom{\kern11em}}$
First/Second is better.

\begin{align} \op{FOIL\_Gain} &= p' \left( \log \frac{p'}{p' + n'} - \log \frac{p}{p + n} \right)\\ &= \underbrace{(p' + n')}_{\text{(1)}} \underbrace{\frac{p'}{p' + n'}}_{\text{(2)}} \underbrace{\left( \log \frac{p'}{p' + n'} - \log \frac{p}{p + n} \right)}_{\text{(3)}} \end{align}

Heuristics:
- (1) favors rules with large coverage/confidence.
- (2)*(3) favors rules with large coverage/confidence given the same coverage/confidence.
- (3) ensures $\op{FOIL\_Gain}$ is positive if coverage/confidence increases.
[Challenge] Why not use information gain or gain ratio?

8.4 Rule-Based Classification
(Optional) Eibe Frank, Ian H. Witten. “Generating accurate rule sets without global optimization.” Fifteenth International Conference on Machine Learning, 1998, p.144-151.
- A partial tree is built with nodes (subsets of data) split (expanded) in the order of their entropy.
- A node is considered for pruning by subtree replacement if all its children are leaf nodes.
(Optional) Cohen, William W. “Fast effective rule induction.” Machine Learning Proceedings, 1995, p.115-123. (See also WEKA JRIP or its source code.)
- The algorithm stops adding rules to the rule-set if the description length of the new rule is 64 bits more than the minimum description length met.
- After the algorithm stops adding rules, there is a rule optimization step that optimizes each rule one-by-one.