load("datamining.mac")$Decision tree induction¶
Distribution¶
empirical(lst) computes the empirical distribution of the list lst.
block(
[lst:[2,2,1,3,3,3]],
empirical(lst)
);A pair is returned, where
- the first element is the list of unique values sorted in ascending order, and
- the second element is their fractional number of occurences.
Information gain¶
An impurity measure for decision tree induction is entropy computed as entropy(ps) for some distribution ps as a list of probability masses:
entropy(ps);The information gain ratios and related information quantities can be computed as follows:
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
n: 6,
gen: lambda([i], [i, random(2), random(2), random(2)]),
conds: ['X_1, 'X_2],
target: 'Y,
data, info
],
data: build_data(fns, gen, n),
[
data,
Info(data, target),
build_data_from_list(
['X, 'Info[X], 'Gain[X], 'SplitInfo[X], 'GainRatio[X]],
makelist(
map('simplify,
[X,
InfoX(data, target, X),
Gain(data, target, X),
SplitInfo(data, X),
GainRatio(data, target, X)]
),
X, conds
)
)
]
);where
Info(data, target)computes the information content (entropy) oftargetindata.InfoX(data, target, X)computes the information (conditional entropy) givenX.Gain(data, target, X)calculates the information gain oftargetwithX.SplitInfo(data, X)calculates the split information (entropy) ofX.GainRatio(data, target, X)calculates the information gain ratio oftargetwithX.
describe(makelist)$describe(map)$Gini impurity¶
Another impurity measure is the Gini impurity, which is computed as gini(ps):
gini(ps);The quantity related to the Gini impurity can be computed as follows:
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
n: 6,
gen: lambda([i], [i, random(2), random(2), random(2)]),
conds: ['X_1, 'X_2, chi('X_1<=0.5), chi('X_2>0.5)],
target: 'Y,
data
],
data: build_data(fns, gen, n),
[
data, Gini(data, target),
build_data_from_list(
['X, 'Gini[X], 'GiniDrop[X]],
makelist(
[X, GiniX(data, target, X), GiniDrop(data, target, X)],
X, conds
)
)
]
);where
Gini(data, target)computes the Gini impurity oftargetindata.GiniX(data, target, X)computes the conditional Gini impurity oftargetconditioned onX.GiniDrop(data, target, X)computes the drop in Gini impurity for a splitting criterionX.
Rule-based classifier¶
FOIL gain¶
The following formula computes the FOIL gain
- from a rule covering
p_0positives andn_0negatives - to a rule covering
p_1positives andn_1negatives.
foilgain(p_0,n_0,p_1,n_1);To compute FOIL gain from data:
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
n: 6,
gen: lambda([i], [i, random(2), random(2), random(2)]),
cjts: ['X_1=1, 'X_2=1],
target: 'Y
],
R: [ar(rest(cjts, -1),target=1), ar(cjts,target=1)],
data: build_data(fns, gen, n),
[data,
build_data_from_list(
["Original rule", "New rule", 'FOILGain],
[[R[1], R[2], FOILGain(data, target, cjts)]])]
);FOILGain(data, target, cjts) returns the FOIL gain from rule to rule where
- :
rest(cjts,-1) - :
cjts
and rest(cjts,-1) is the list of conjuncts in cjts except the last one.
FOIL prune¶
FOIL prune can be computed from the number p of positives and the number n of negatives covered by a rule.
foilprune(p,n);To compute FOIL prune from data:
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
n: 6,
gen: lambda([i], [i, random(2), random(2), random(2)]),
cjts: ['X_1=1, 'X_2=1],
target: 'Y,
data
],
R: [ar(cjts,target=1), ar(rest(cjts, -1),target=1)],
data: build_data(fns, gen, n),
FP: FOILPrune(data, target, cjts),
[data,
build_data_from_list(
["Rule", 'FOILPrune],
makelist([R[i], FP[i]], i, [1,2]))]
);It returns a pair consisting of the FOIL prunes for the rules
- :
cjts - :
rest(cjts,-1)