Introduction¶
This notebook demonstrates the data mining package written in Maxima, which is helpful for
- computing some mathematical criteria precisely without numerical error/instability, and
- creating randomized moodle stack questions.
The implementations are simplified and may not be scalable to large data sets.
To load the package, run the following cell:
load("datamining.mac")$To learn Maxima, you may use the describe function or refer to the documentation for more details:
describe(block)$As an example, the following defines a function that computes the maxima of its arguments:
maxima([lst]):=
if length(lst)>1
/* recur on tail maxima (tm) */
then block(
[tm :apply('maxima,rest(lst))],
if lst[1]>=tm[2]
then maxima(lst[1])
else [tm[1]+1,tm[2]]
)
/* base cases */
else if length(lst)>0
then [1, lst[1]]
else [0, -inf]$
maxima(1,2,3,2,1);In the above example, maxima([lst]) is a recursive function that
- takes a variable number of arguments, which will be stored in
lstas a list, and - returns a list
[i,m]as follows:- If
lstis non-empty,lst[i]=mis a maximum element oflstandiis the smallest such index. - If
lstis empty, then[0,-inf]is returned, following the conventions that- the maximum element of an empty list
[]of numbers is-inf, and - Maxima uses 1-based numbering so
0is the index of an imaginary item before the first item in a list.
- the maximum element of an empty list
- If
Generate data from lists¶
Data is a matrix of feature values associated with feature names. Data can be created by build_data_from_list(fns, lst) where
fnsis a list of feature names, andlstis a list of instances, which are lists of feature values corresponding to the feature names.
set_draw_defaults(file_name="images/maxplot", terminal=svg, point_type=square, point_size=2)$
block(
[
fns: ['i, 'X_1, 'X_2, 'Y], /* feature names */
lst: [[1, 0, 0, 0], [2, 1, 1, 1]], /* instances */
target: 'Y,
xy: ['X_1, 'X_2],
data
],
data: build_data_from_list(fns, lst),
plot_labeled_data(data,xy,target),
[
data,
feature_names(data),
size(data),
feature_index(fns, target),
get_data(data, 1),
feature_values(data, target)
]
);Information of the data can be obtained using other functions:
feature_names(data)returns the feature names ofdata.size(data)returns the number of instances ofdata.feature_index(fns, fn)returns the index of a feature namedfnin the listfnsof feature names.get_data(data, i)returns thei-th instance ofdata.feature_values(data, fn)returns the list of feature values of the featurefn.plot_labeled_data(data,xy,target)
plot_labeled_data(data,xy,target)plots the labeled data where
xyspecifies the pair of features for the and axes, andtargetis used to color code the data points.
Generate data with rules¶
Data can also be generated (randomly) according to some specified rules using build_data(fns, gen, n) where
fnsis a list of feature names,genis a function that takes a unique index and generates an instance associated with the index, andnis the number of instances to generate.
block(
[
fns: ['i, 'X_1, 'X_2, 'Y],
gen: lambda([i],
[
i,
random(3),
random(3),
if 'X_1<1 and 'X_2>0 then 1 else 0
]),
n: 10
],
build_data(fns, gen, n)
);In the above example,
- is the unique index,
- and are uniformly random generated from , and
- is a deterministic function of and , namely,
Transform features¶
New features can be created by transforming existing ones using transform_features(data, nfns, ngen) where
datais a data set,nfnsis the list of new feature names, andngenis a function that takes a unique index and returns an instance.
block(
[
fns: ['X_1, 'X_2],
gen: lambda([i],
[
random(3),
random(3)
]),
n: 10,
nfns: ['i, 'X_1, 'X_2, 'Y],
ngen: lambda([i],
[
i,
'X_1,
'X_2,
if 'X_1<1 and 'X_2>0 then 1 else 0
]
),
data
],
data: build_data(fns, gen, n),
[data, transform_features(data, nfns, ngen)]
);In the above example,
- the features and in
dataare transformed to create the feature , and - the row index is used to create the feature .
Subsample data¶
To subsample data based on specific conditions, use subsample_data(data, cond) where
datais the data to subsample, andcondis a function that takes a row index and returns a boolean expression on the feature names.
It returns data but keeping only the instances indexed by i where cond[i] evaluates to true with the feature names substituted by the corresponding feature values.
block(
[
fns: ['X_1, 'X_2],
gen: lambda([i],
[
random(3),
random(3)
]),
n: 10,
cond: lambda([i],
'X_1<1 and 'X_2>0
),
data
],
data: build_data(fns, gen, n),
[data, subsample_data(data, cond)]
);In the above example, only instances with and are returned.
Combine data¶
Data can be stacked (vertically) by stack_data(data_1, data_2, ...) where data_i’s are data with the same list of features.
block(
[
fns: ['i, 'X_1, 'X_2]
],
data_1: build_data(fns, lambda([i], [i, random(2), random(2)]),4),
data_2: build_data(fns, lambda([i], [i, 3+random(2), random(2)]),4),
data: transform_features(stack_data(data_1, data_2), fns, lambda([i], [i, 'X_1, 'X_2])),
[data_1, data_2, data]
);In the above example, data consists of instances from data_1 and data_2.