Skip to article frontmatterSkip to article content

Introduction

This notebook demonstrates the data mining package written in Maxima, which is helpful for

  • computing some mathematical criteria precisely without numerical error/instability, and
  • creating randomized moodle stack questions.

The implementations are simplified and may not be scalable to large data sets.

To load the package, run the following cell:

load("datamining.mac")$

To learn Maxima, you may use the describe function or refer to the documentation for more details:

describe(block)$

As an example, the following defines a function that computes the maxima of its arguments:

maxima([lst]):=
if length(lst)>1 
/* recur on tail maxima (tm) */
then block(
    [tm :apply('maxima,rest(lst))],
    if lst[1]>=tm[2] 
    then maxima(lst[1]) 
    else [tm[1]+1,tm[2]]
)
/* base cases */
else if length(lst)>0 
then [1, lst[1]]
else [0, -inf]$

maxima(1,2,3,2,1);

In the above example, maxima([lst]) is a recursive function that

  • takes a variable number of arguments, which will be stored in lst as a list, and
  • returns a list [i,m] as follows:
    • If lst is non-empty, lst[i]=m is a maximum element of lst and i is the smallest such index.
    • If lst is empty, then [0,-inf] is returned, following the conventions that
      • the maximum element of an empty list [] of numbers is -inf, and
      • Maxima uses 1-based numbering so 0 is the index of an imaginary item before the first item in a list.

Generate data from lists

Data is a matrix of feature values associated with feature names. Data can be created by build_data_from_list(fns, lst) where

  • fns is a list of feature names, and
  • lst is a list of instances, which are lists of feature values corresponding to the feature names.
set_draw_defaults(file_name="images/maxplot", terminal=svg, point_type=square, point_size=2)$

block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],           /* feature names */
        lst: [[1, 0, 0, 0], [2, 1, 1, 1]],   /* instances */
        target: 'Y,
        xy: ['X_1, 'X_2],
        data
    ],
    data: build_data_from_list(fns, lst),
    plot_labeled_data(data,xy,target),
    [
        data, 
        feature_names(data), 
        size(data), 
        feature_index(fns, target), 
        get_data(data, 1), 
        feature_values(data, target)
    ]
);

Information of the data can be obtained using other functions:

  • feature_names(data) returns the feature names of data.
  • size(data) returns the number of instances of data.
  • feature_index(fns, fn) returns the index of a feature named fn in the list fns of feature names.
  • get_data(data, i) returns the i-th instance of data.
  • feature_values(data, fn) returns the list of feature values of the feature fn.
  • plot_labeled_data(data,xy,target)
plot_labeled_data(data,xy,target)

plots the labeled data where

  • xy specifies the pair of features for the xx and yy axes, and
  • target is used to color code the data points.

Generate data with rules

Data can also be generated (randomly) according to some specified rules using build_data(fns, gen, n) where

  • fns is a list of feature names,
  • gen is a function that takes a unique index and generates an instance associated with the index, and
  • n is the number of instances to generate.
block(
    [
        fns: ['i, 'X_1, 'X_2, 'Y],
        gen: lambda([i],
            [
                i,
                random(3),
                random(3),
                if 'X_1<1 and 'X_2>0 then 1 else 0
            ]),
        n: 10
    ],
    build_data(fns, gen, n)
);

In the above example,

  • ii is the unique index,
  • X1X_1 and X2X_2 are uniformly random generated from {0,1,2}\Set{0,1,2}, and
  • YY is a deterministic function of X1X_1 and X2X_2, namely,
    Y={1X1<1,X2>00otherwise.Y=\begin{cases} 1 & X_1<1, X_2>0\\ 0 & \text{otherwise.} \end{cases}

Transform features

New features can be created by transforming existing ones using transform_features(data, nfns, ngen) where

  • data is a data set,
  • nfns is the list of new feature names, and
  • ngen is a function that takes a unique index and returns an instance.
block(
    [
        fns: ['X_1, 'X_2],
        gen: lambda([i], 
            [
                random(3), 
                random(3)
            ]),
        n: 10,
        nfns: ['i, 'X_1, 'X_2, 'Y],
        ngen: lambda([i],
            [
                i,
                'X_1,
                'X_2,
               if 'X_1<1 and 'X_2>0 then 1 else 0 
            ]
        ),
        data
    ],
    data: build_data(fns, gen, n),
    [data, transform_features(data, nfns, ngen)]
);

In the above example,

  • the features X1X1 and X2X2 in data are transformed to create the feature YY, and
  • the row index is used to create the feature ii.

Subsample data

To subsample data based on specific conditions, use subsample_data(data, cond) where

  • data is the data to subsample, and
  • cond is a function that takes a row index and returns a boolean expression on the feature names.

It returns data but keeping only the instances indexed by i where cond[i] evaluates to true with the feature names substituted by the corresponding feature values.

block(
    [
        fns: ['X_1, 'X_2],
        gen: lambda([i],
            [
                random(3),
                random(3)
            ]),
        n: 10,
        cond: lambda([i],
            'X_1<1 and 'X_2>0
        ),
        data
    ],
    data: build_data(fns, gen, n),
    [data, subsample_data(data, cond)]
);

In the above example, only instances with X1<1X_1<1 and X2>0X_2>0 are returned.

Combine data

Data can be stacked (vertically) by stack_data(data_1, data_2, ...) where data_i’s are data with the same list of features.

block(
    [
        fns: ['i, 'X_1, 'X_2]
    ],
    data_1: build_data(fns, lambda([i], [i, random(2), random(2)]),4),
    data_2: build_data(fns, lambda([i], [i, 3+random(2), random(2)]),4),
    data: transform_features(stack_data(data_1, data_2), fns, lambda([i], [i, 'X_1, 'X_2])),
    [data_1, data_2, data]
);

In the above example, data consists of instances from data_1 and data_2.