Introduction¶
This notebook demonstrates the data mining package written in Maxima, which is helpful for
- computing some mathematical criteria precisely without numerical error/instability, and
- creating randomized moodle stack questions.
The implementations are simplified and may not be scalable to large data sets.
To load the package, run the following cell:
load("datamining.mac")$
To learn Maxima, you may use the describe
function or refer to the documentation for more details:
describe(block)$
As an example, the following defines a function that computes the maxima of its arguments:
maxima([lst]):=
if length(lst)>1
/* recur on tail maxima (tm) */
then block(
[tm :apply('maxima,rest(lst))],
if lst[1]>=tm[2]
then maxima(lst[1])
else [tm[1]+1,tm[2]]
)
/* base cases */
else if length(lst)>0
then [1, lst[1]]
else [0, -inf]$
maxima(1,2,3,2,1);
In the above example, maxima([lst])
is a recursive function that
- takes a variable number of arguments, which will be stored in
lst
as a list, and - returns a list
[i,m]
as follows:- If
lst
is non-empty,lst[i]=m
is a maximum element oflst
andi
is the smallest such index. - If
lst
is empty, then[0,-inf]
is returned, following the conventions that- the maximum element of an empty list
[]
of numbers is-inf
, and - Maxima uses 1-based numbering so
0
is the index of an imaginary item before the first item in a list.
- the maximum element of an empty list
- If
Generate data from lists¶
Data is a matrix of feature values associated with feature names. Data can be created by build_data_from_list(fns, lst)
where
fns
is a list of feature names, andlst
is a list of instances, which are lists of feature values corresponding to the feature names.
set_draw_defaults(file_name="images/maxplot", terminal=svg, point_type=square, point_size=2)$
block(
[
fns: ['i, 'X_1, 'X_2, 'Y], /* feature names */
lst: [[1, 0, 0, 0], [2, 1, 1, 1]], /* instances */
target: 'Y,
xy: ['X_1, 'X_2],
data
],
data: build_data_from_list(fns, lst),
plot_labeled_data(data,xy,target),
[
data,
feature_names(data),
size(data),
feature_index(fns, target),
get_data(data, 1),
feature_values(data, target)
]
);
Information of the data can be obtained using other functions:
feature_names(data)
returns the feature names ofdata
.size(data)
returns the number of instances ofdata
.feature_index(fns, fn)
returns the index of a feature namedfn
in the listfns
of feature names.get_data(data, i)
returns thei
-th instance ofdata
.feature_values(data, fn)
returns the list of feature values of the featurefn
.plot_labeled_data(data,xy,target)
plot_labeled_data(data,xy,target)
plots the labeled data
where
xy
specifies the pair of features for the and axes, andtarget
is used to color code the data points.
Generate data with rules¶
Data can also be generated (randomly) according to some specified rules using build_data(fns, gen, n)
where
fns
is a list of feature names,gen
is a function that takes a unique index and generates an instance associated with the index, andn
is the number of instances to generate.
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
gen: lambda([i],
[
i,
random(3),
random(3),
if 'X_1<1 and 'X_2>0 then 1 else 0
]),
n: 10
],
build_data(fns, gen, n)
);
In the above example,
- is the unique index,
- and are uniformly random generated from , and
- is a deterministic function of and , namely,
Transform features¶
New features can be created by transforming existing ones using transform_features(data, nfns, ngen)
where
data
is a data set,nfns
is the list of new feature names, andngen
is a function that takes a unique index and returns an instance.
block(
[
fns: ['X_1, 'X_2],
gen: lambda([i],
[
random(3),
random(3)
]),
n: 10,
nfns: ['i, 'X_1, 'X_2, 'Y],
ngen: lambda([i],
[
i,
'X_1,
'X_2,
if 'X_1<1 and 'X_2>0 then 1 else 0
]
),
data
],
data: build_data(fns, gen, n),
[data, transform_features(data, nfns, ngen)]
);
In the above example,
- the features and in
data
are transformed to create the feature , and - the row index is used to create the feature .
Subsample data¶
To subsample data based on specific conditions, use subsample_data(data, cond)
where
data
is the data to subsample, andcond
is a function that takes a row index and returns a boolean expression on the feature names.
It returns data
but keeping only the instances indexed by i
where cond[i]
evaluates to true with the feature names substituted by the corresponding feature values.
block(
[
fns: ['X_1, 'X_2],
gen: lambda([i],
[
random(3),
random(3)
]),
n: 10,
cond: lambda([i],
'X_1<1 and 'X_2>0
),
data
],
data: build_data(fns, gen, n),
[data, subsample_data(data, cond)]
);
In the above example, only instances with and are returned.
Combine data¶
Data can be stacked (vertically) by stack_data(data_1, data_2, ...)
where data_i
’s are data with the same list of features.
block(
[
fns: ['i, 'X_1, 'X_2]
],
data_1: build_data(fns, lambda([i], [i, random(2), random(2)]),4),
data_2: build_data(fns, lambda([i], [i, 3+random(2), random(2)]),4),
data: transform_features(stack_data(data_1, data_2), fns, lambda([i], [i, 'X_1, 'X_2])),
[data_1, data_2, data]
);
In the above example, data
consists of instances from data_1
and data_2
.