Skip to article frontmatterSkip to article content
load("datamining.mac")$

Decision tree induction

Distribution

empirical(lst) computes the empirical distribution of the list lst.

block(
    [lst:[2,2,1,3,3,3]],
    empirical(lst)
);

A pair is returned, where

  • the first element is the list of unique values sorted in ascending order, and
  • the second element is their fractional number of occurences.

Information gain

An impurity measure for decision tree induction is entropy computed as entropy(ps) for some distribution ps as a list of probability masses:

entropy(ps);

The information gain ratios and related information quantities can be computed as follows:

block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        conds: ['X_1, 'X_2],
        target: 'Y,
        data, info
    ],
    data: build_data(fns, gen, n),
    [
        data,
        Info(data, target),
        build_data_from_list(
            ['X, 'Info[X], 'Gain[X], 'SplitInfo[X], 'GainRatio[X]],
            makelist(
                map('simplify,
                    [X,
                     InfoX(data, target, X), 
                     Gain(data, target, X), 
                     SplitInfo(data, X), 
                     GainRatio(data, target, X)]
                ), 
                X, conds
            )
        )
    ]
);

where

  • Info(data, target) computes the information content (entropy) of target in data.
  • InfoX(data, target, X) computes the information (conditional entropy) given X.
  • Gain(data, target, X) calculates the information gain of target with X.
  • SplitInfo(data, X) calculates the split information (entropy) of X.
  • GainRatio(data, target, X) calculates the information gain ratio of target with X.
describe(makelist)$
describe(map)$

Gini impurity

Another impurity measure is the Gini impurity, which is computed as gini(ps):

gini(ps);

The quantity related to the Gini impurity can be computed as follows:

block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        conds: ['X_1, 'X_2, chi('X_1<=0.5), chi('X_2>0.5)],
        target: 'Y,
        data
    ],
    data: build_data(fns, gen, n),
    [
        data, Gini(data, target),
        build_data_from_list(
            ['X, 'Gini[X], 'GiniDrop[X]],
            makelist(
                [X, GiniX(data, target, X), GiniDrop(data, target, X)],
                X, conds
            )
        )
    ]
);

where

  • Gini(data, target) computes the Gini impurity of target in data.
  • GiniX(data, target, X) computes the conditional Gini impurity of target conditioned on X.
  • GiniDrop(data, target, X) computes the drop in Gini impurity for a splitting criterion X.

Rule-based classifier

FOIL gain

The following formula computes the FOIL gain

  • from a rule covering p_0 positives and n_0 negatives
  • to a rule covering p_1 positives and n_1 negatives.
foilgain(p_0,n_0,p_1,n_1);

To compute FOIL gain from data:

block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        cjts: ['X_1=1, 'X_2=1],
        target: 'Y
    ],
    R: [ar(rest(cjts, -1),target=1), ar(cjts,target=1)],
    data: build_data(fns, gen, n),
    [data, 
    build_data_from_list(
        ["Original rule", "New rule", 'FOILGain],
        [[R[1], R[2], FOILGain(data, target, cjts)]])]
);

FOILGain(data, target, cjts) returns the FOIL gain from rule RR' to rule RR where

  • RR': rest(cjts,-1)     Y=1\implies Y=1
  • RR: cjts     Y=1\implies Y=1

and rest(cjts,-1) is the list of conjuncts in cjts except the last one.

FOIL prune

FOIL prune can be computed from the number p of positives and the number n of negatives covered by a rule.

foilprune(p,n);

To compute FOIL prune from data:

block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        n: 6,
        gen: lambda([i], [i, random(2), random(2), random(2)]),
        cjts: ['X_1=1, 'X_2=1],
        target: 'Y,
        data
    ],
    R: [ar(cjts,target=1), ar(rest(cjts, -1),target=1)],
    data: build_data(fns, gen, n),
    FP: FOILPrune(data, target, cjts),
    [data, 
    build_data_from_list(
        ["Rule", 'FOILPrune],
        makelist([R[i], FP[i]], i, [1,2]))]
);

It returns a pair consisting of the FOIL prunes for the rules

  • RR: cjts     Y=1\implies Y=1
  • RR': rest(cjts,-1)     Y=1\implies Y=1