Project 2 - CS5483

import os

if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

Requirements¶

Project 2 is a group project where each group should identify a relevant topic that involves all the following elements:

An objective that involves extracting knowledge from a real-world dataset.
A data preprocessing step that prepares the data for mining.
A learning algorithm that extracts knowledge from the data.
An evaluation of the mined knowledge.

There are several tasks associated with Project 2:

Presentation video: Submit one 15-minute video per group to the Group presentation assignment on Canvas.
Peer Reviews of Group Presentations: Each student will be assigned 3 group presentations to review from the Group presentation assignment on Canvas.
Report: Submit one report per group to the Group report assignment on Canvas.

The project is worth 15 points, which accounts for 15% of the entire course assessment. The assessment is divided into 5 categories, each of which is scored on a scale of 0-3 points:

3: Excellent
2: Satisfactory
1: Unsatisfactory
0: Incomplete

Rubrics

Presentation (3 points)
- Is the problem well formulated and motivated?
- Is there proper use of visualization techniques to convey the results concisely within the time limit?
- Are questions from the audience addressed well?
Report (3 points)
- Does the report contain the essential elements such as the title, abstract, introduction, conclusion, and references?
- Are the problem and results described clearly with proper citations?
- Are the results reproducible by well-documented code?
Correctness (3 points)
- Does the project contain the required elements?
- Are the results correct?
- Are the learning process and evaluation methods appropriate, leading to correct results?
Technical elements (3 points)
- How sophisticated are the techniques used for preprocessing, learning, and evaluation?
- Is the quality of the mined knowledge better than existing ones?
- Are there meaningful generalizations to related classes of problems?
Team spirit (3 points)
- Can the team challenge other teams successfully for the group presentation?
- Can the team members work together efficiently?
- Is the workload evenly divided among members?

Data Sources¶

Here are some websites that provide trustworthy real-world datasets:

The following sources of data are widely popular, but their datasets may be synthetic or heavily studied. If you plan to use datasets from these sources, you will need to put extra effort into the following aspects:

Verify the dataset’s authenticity by citing trustworthy and original sources properly.
Clearly differentiate your approach and results from existing works.

There are many other ways to look for reliable data sources. For instance, you might try Google Dataset Search to find datasets, and large language models (LLMs) to provide concrete examples.

%%ai chatgpt -f markdown
I am doing a group project in a data mining course. Explain how I can find
a good real-world dataset and the corresponding data mining objective? Give me
a list of 10 examples. (Do not use any headers in your reply.)

Group Server¶

Members of the same group can collaboratively work on the same notebook using a group (Jupyter) server that have higher resource limits than the individual user servers:

Storage: 100GB
Memory: 100GB
CPU: 32 cores for default servers without GPU, 8 cores for GPU servers
GPU: 48GB for GPU servers

Group servers also run JupyterLab in collaborative mode, which provides real-time collaboration features, allowing multiple users to see each other and work on the same notebook simultaneously. For more details, see Figure 1 and the jupyterlab-collaboration package.

To access and manage the group server:

Access the Hub Control Panel:
- Within the JupyterLab interface, click File->Hub Control Panel.
Select the Admin Panel:
- In the top navigation bar of the Hub Control Panel, select the Admin Panel as shown in Figure 2.
Locate the Group User:
- Within the Admin Panel, look for the user named group{n}, where {n} corresponds to the group number.
Manage the Group Server:
- If the group server has not started:
  - Click the action button labeled Spawn Page to select the server options with higher resource limits.
  - If you click the action button labeled Start Server, the server will start with lower resource limits that apply to individual user servers.
- If the group server is already running:
  - Click the action button labeled Access Server to access the currently running server.
  - If necessary, click the action button labeled Stop Server to terminate the existing server.

Figure 2:Admin panel for managing group server.

To facilitates file transfer and sharing among members, each member of a project group can access the group home directory from their individual user servers:

Accessing from a terminal:
- Members can access the group home directory by navigating to the mounted path in a terminal app in JupyterLab or VSCode interface.
- For example, if the group is group0, they can use the following command in the terminal:
```
cd /group0
```
Accessing via a soft link
- To make it easier to access the group home directory from the JupyterLab file explorer, members can create a soft link.
- This can be done using the ln -s command. For example, if the group is group1, they can create a soft link named group_home in the user home directory:
```
ln -s /group1 ~/group_home
```
  Refresh the file browser to see the group_home folder in the JupyterLab file explorer as shown in .

Custom Packages¶

To ensure the reproducibility of your results, you are required to use programming instead of the WEKA graphical interface to complete the project. Specifically, you can access WEKA’s tools through the python-weka-wrapper3 module, which allows you to use Python instead of Java. You can also install additional packages using the commands

conda install, if the package is available on Anaconda, or
pip install, if the package is available on PyPI.

%%ai chatgpt -f text
What are the pros and cons of conda install vs pip install?

The installation might not persist after restarting the Jupyter server because the default environment is not saved permanently. To keep the installation, create a conda environment in your home directory, which will be saved permanently.

For instance, if you would like to use xgboost and python-weka-wrapper3 in the same notebook, run the following to create a conda environment:^[1]

myenv=myenvname
cat <<EOF > /tmp/myenv.yaml && mamba env create -n "${myenv}" -f /tmp/myenv.yaml
dependencies:
  - python=3.11
  - pip
  - ipykernel
  - xgboost
  - pip:
    - python-weka-wrapper3
EOF

where myenvname can be any valid environment name.

Afterwards, you can create a kernel using the command:^[2]

conda activate ${myenv}
python -m ipykernel install \
    --user \
    --name "${myenv}" --display-name "${myenv}"

Reload the browser window for the kernel to take effect.

How to clean up a conda environment?

To deactivate the conda environment in a terminal, run

conda deactivate

To delete the kernel, run the command

rm -rf ~/.local/share/jupyter/kernels/${myenv}

To delete the conda environment, run

conda deactivate
mamba env remove -n ${myenv}

%%ai chatgpt -f text
How to create a conda environment that inherit all the packages from the base
environment? Will this take a long time and create duplicate files?