Classification

load("datamining.mac")$

Decision tree induction¶

Distribution¶

empirical(lst) computes the empirical distribution of the list lst.

block(
    [lst:[2,2,1,3,3,3]],
    empirical(lst)
);

A pair is returned, where

the first element is the list of unique values sorted in ascending order, and
the second element is their fractional number of occurences.

Information gain¶

An impurity measure for decision tree induction is entropy computed as entropy(ps) for some distribution ps as a list of probability masses:

entropy(ps);

The information gain ratios and related information quantities can be computed as follows:

block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        conds: ['X_1, 'X_2],
        target: 'Y,
        data, info
    ],
    data: build_data(fns, gen, n),
    [
        data,
        Info(data, target),
        build_data_from_list(
            ['X, 'Info[X], 'Gain[X], 'SplitInfo[X], 'GainRatio[X]],
            makelist(
                map('simplify,
                    [X,
                     InfoX(data, target, X), 
                     Gain(data, target, X), 
                     SplitInfo(data, X), 
                     GainRatio(data, target, X)]
                ), 
                X, conds
            )
        )
    ]
);

where

Info(data, target) computes the information content (entropy) of target in data.
InfoX(data, target, X) computes the information (conditional entropy) given X.
Gain(data, target, X) calculates the information gain of target with X.
SplitInfo(data, X) calculates the split information (entropy) of X.
GainRatio(data, target, X) calculates the information gain ratio of target with X.

describe(makelist)$

describe(map)$

Gini impurity¶

Another impurity measure is the Gini impurity, which is computed as gini(ps):

gini(ps);

The quantity related to the Gini impurity can be computed as follows:

block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        conds: ['X_1, 'X_2, chi('X_1<=0.5), chi('X_2>0.5)],
        target: 'Y,
        data
    ],
    data: build_data(fns, gen, n),
    [
        data, Gini(data, target),
        build_data_from_list(
            ['X, 'Gini[X], 'GiniDrop[X]],
            makelist(
                [X, GiniX(data, target, X), GiniDrop(data, target, X)],
                X, conds
            )
        )
    ]
);

where

Gini(data, target) computes the Gini impurity of target in data.
GiniX(data, target, X) computes the conditional Gini impurity of target conditioned on X.
GiniDrop(data, target, X) computes the drop in Gini impurity for a splitting criterion X.

Rule-based classifier¶

FOIL gain¶

The following formula computes the FOIL gain

from a rule covering p_0 positives and n_0 negatives
to a rule covering p_1 positives and n_1 negatives.

foilgain(p_0,n_0,p_1,n_1);

To compute FOIL gain from data:

block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        cjts: ['X_1=1, 'X_2=1],
        target: 'Y
    ],
    R: [ar(rest(cjts, -1),target=1), ar(cjts,target=1)],
    data: build_data(fns, gen, n),
    [data, 
    build_data_from_list(
        ["Original rule", "New rule", 'FOILGain],
        [[R[1], R[2], FOILGain(data, target, cjts)]])]
);

FOILGain(data, target, cjts) returns the FOIL gain from rule $R'$ to rule $R$ where

$R'$ : rest(cjts,-1) $\implies Y=1$
$R$ : cjts $\implies Y=1$

and rest(cjts,-1) is the list of conjuncts in cjts except the last one.

FOIL prune¶

FOIL prune can be computed from the number p of positives and the number n of negatives covered by a rule.

foilprune(p,n);

To compute FOIL prune from data:

block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        cjts: ['X_1=1, 'X_2=1],
        target: 'Y,
        data
    ],
    R: [ar(cjts,target=1), ar(rest(cjts, -1),target=1)],
    data: build_data(fns, gen, n),
    FP: FOILPrune(data, target, cjts),
    [data, 
    build_data_from_list(
        ["Rule", 'FOILPrune],
        makelist([R[i], FP[i]], i, [1,2]))]
);

It returns a pair consisting of the FOIL prunes for the rules

$R$ : cjts $\implies Y=1$
$R'$ : rest(cjts,-1) $\implies Y=1$