load("datamining.mac")$
Decision tree induction¶
Distribution¶
empirical(lst)
computes the empirical distribution of the list lst
.
block(
[lst:[2,2,1,3,3,3]],
empirical(lst)
);
A pair is returned, where
- the first element is the list of unique values sorted in ascending order, and
- the second element is their fractional number of occurences.
Information gain¶
An impurity measure for decision tree induction is entropy computed as entropy(ps)
for some distribution ps
as a list of probability masses:
entropy(ps);
The information gain ratios and related information quantities can be computed as follows:
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
n: 6,
gen: lambda([i], [i, random(2), random(2), random(2)]),
conds: ['X_1, 'X_2],
target: 'Y,
data, info
],
data: build_data(fns, gen, n),
[
data,
Info(data, target),
build_data_from_list(
['X, 'Info[X], 'Gain[X], 'SplitInfo[X], 'GainRatio[X]],
makelist(
map('simplify,
[X,
InfoX(data, target, X),
Gain(data, target, X),
SplitInfo(data, X),
GainRatio(data, target, X)]
),
X, conds
)
)
]
);
where
Info(data, target)
computes the information content (entropy) oftarget
indata
.InfoX(data, target, X)
computes the information (conditional entropy) givenX
.Gain(data, target, X)
calculates the information gain oftarget
withX
.SplitInfo(data, X)
calculates the split information (entropy) ofX
.GainRatio(data, target, X)
calculates the information gain ratio oftarget
withX
.
describe(makelist)$
describe(map)$
Gini impurity¶
Another impurity measure is the Gini impurity, which is computed as gini(ps)
:
gini(ps);
The quantity related to the Gini impurity can be computed as follows:
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
n: 6,
gen: lambda([i], [i, random(2), random(2), random(2)]),
conds: ['X_1, 'X_2, chi('X_1<=0.5), chi('X_2>0.5)],
target: 'Y,
data
],
data: build_data(fns, gen, n),
[
data, Gini(data, target),
build_data_from_list(
['X, 'Gini[X], 'GiniDrop[X]],
makelist(
[X, GiniX(data, target, X), GiniDrop(data, target, X)],
X, conds
)
)
]
);
where
Gini(data, target)
computes the Gini impurity oftarget
indata
.GiniX(data, target, X)
computes the conditional Gini impurity oftarget
conditioned onX
.GiniDrop(data, target, X)
computes the drop in Gini impurity for a splitting criterionX
.
Rule-based classifier¶
FOIL gain¶
The following formula computes the FOIL gain
- from a rule covering
p_0
positives andn_0
negatives - to a rule covering
p_1
positives andn_1
negatives.
foilgain(p_0,n_0,p_1,n_1);
To compute FOIL gain from data:
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
n: 6,
gen: lambda([i], [i, random(2), random(2), random(2)]),
cjts: ['X_1=1, 'X_2=1],
target: 'Y
],
R: [ar(rest(cjts, -1),target=1), ar(cjts,target=1)],
data: build_data(fns, gen, n),
[data,
build_data_from_list(
["Original rule", "New rule", 'FOILGain],
[[R[1], R[2], FOILGain(data, target, cjts)]])]
);
FOILGain(data, target, cjts)
returns the FOIL gain from rule to rule where
- :
rest(cjts,-1)
- :
cjts
and rest(cjts,-1)
is the list of conjuncts in cjts
except the last one.
FOIL prune¶
FOIL prune can be computed from the number p
of positives and the number n
of negatives covered by a rule.
foilprune(p,n);
To compute FOIL prune from data:
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
n: 6,
gen: lambda([i], [i, random(2), random(2), random(2)]),
cjts: ['X_1=1, 'X_2=1],
target: 'Y,
data
],
R: [ar(cjts,target=1), ar(rest(cjts, -1),target=1)],
data: build_data(fns, gen, n),
FP: FOILPrune(data, target, cjts),
[data,
build_data_from_list(
["Rule", 'FOILPrune],
makelist([R[i], FP[i]], i, [1,2]))]
);
It returns a pair consisting of the FOIL prunes for the rules
- :
cjts
- :
rest(cjts,-1)