Skip to article frontmatterSkip to article content
import os

if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

Requirements

Project 2 is a group project where each group should identify a relevant topic that involves all the following elements:

  1. An objective that involves extracting knowledge from a real-world dataset.
  2. A data preprocessing step that prepares the data for mining.
  3. A learning algorithm that extracts knowledge from the data.
  4. An evaluation of the mined knowledge.

There are several tasks associated with Project 2:

  1. Presentation video: Submit one 15-minute video per group to the Group presentation assignment on Canvas.
  2. Peer Reviews of Group Presentations: Each student will be assigned 3 group presentations to review from the Group presentation assignment on Canvas.
  3. Report: Submit one report per group to the Group report assignment on Canvas.

The project is worth 15 points, which accounts for 15% of the entire course assessment. The assessment is divided into 5 categories, each of which is scored on a scale of 0-3 points:

  • 3: Excellent
  • 2: Satisfactory
  • 1: Unsatisfactory
  • 0: Incomplete

Data Sources

Here are some websites that provide trustworthy real-world datasets:

The following sources of data are widely popular, but their datasets may be synthetic or heavily studied. If you plan to use datasets from these sources, you will need to put extra effort into the following aspects:

  1. Verify the dataset’s authenticity by citing trustworthy and original sources properly.
  2. Clearly differentiate your approach and results from existing works.

There are many other ways to look for reliable data sources. For instance, you might try Google Dataset Search to find datasets, and large language models (LLMs) to provide concrete examples.

%%ai chatgpt -f markdown
I am doing a group project in a data mining course. Explain how I can find
a good real-world dataset and the corresponding data mining objective? Give me
a list of 10 examples. (Do not use any headers in your reply.)

Group Server

Members of the same group can collaboratively work on the same notebook using a group (Jupyter) server that have higher resource limits than the individual user servers:

  • Storage: 100GB
  • Memory: 100GB
  • CPU: 32 cores for default servers without GPU, 8 cores for GPU servers
  • GPU: 48GB for GPU servers

Group servers also run JupyterLab in collaborative mode, which provides real-time collaboration features, allowing multiple users to see each other and work on the same notebook simultaneously. For more details, see Figure 1 and the jupyterlab-collaboration package.

Collaborative mode

Figure 1:Collaborative mode in JupyterLab.

To access and manage the group server:

  1. Access the Hub Control Panel:

    • Within the JupyterLab interface, click File->Hub Control Panel.
  2. Select the Admin Panel:

    • In the top navigation bar of the Hub Control Panel, select the Admin Panel as shown in Figure 2.
  3. Locate the Group User:

    • Within the Admin Panel, look for the user named group{n}, where {n} corresponds to the group number.
  4. Manage the Group Server:

    • If the group server has not started:
      • Click the action button labeled Spawn Page to select the server options with higher resource limits.
      • If you click the action button labeled Start Server, the server will start with lower resource limits that apply to individual user servers.
    • If the group server is already running:
      • Click the action button labeled Access Server to access the currently running server.
      • If necessary, click the action button labeled Stop Server to terminate the existing server.
Admin panel

Figure 2:Admin panel for managing group server.

To facilitates file transfer and sharing among members, each member of a project group can access the group home directory from their individual user servers:

  • Accessing from a terminal:
    • Members can access the group home directory by navigating to the mounted path in a terminal app in JupyterLab or VSCode interface.
    • For example, if the group is group0, they can use the following command in the terminal:
      cd /group0
  • Accessing via a soft link
    • To make it easier to access the group home directory from the JupyterLab file explorer, members can create a soft link.
    • This can be done using the ln -s command. For example, if the group is group1, they can create a soft link named group_home in the user home directory:
      ln -s /group1 ~/group_home
      Refresh the file browser to see the group_home folder in the JupyterLab file explorer as shown in .
Group home directory

Figure 3:Group home directory mounted in member server.

Custom Packages

To ensure the reproducibility of your results, you are required to use programming instead of the WEKA graphical interface to complete the project. Specifically, you can access WEKA’s tools through the python-weka-wrapper3 module, which allows you to use Python instead of Java. You can also install additional packages using the commands

%%ai chatgpt -f text
What are the pros and cons of conda install vs pip install?

The installation might not persist after restarting the Jupyter server because the default environment is not saved permanently. To keep the installation, create a conda environment in your home directory, which will be saved permanently.

For instance, if you would like to use xgboost and python-weka-wrapper3 in the same notebook, run the following to create a conda environment:[1]

myenv=myenvname
cat <<EOF > /tmp/myenv.yaml && mamba env create -n "${myenv}" -f /tmp/myenv.yaml
dependencies:
  - python=3.11
  - pip
  - ipykernel
  - xgboost
  - pip:
    - python-weka-wrapper3
EOF

where myenvname can be any valid environment name.

Afterwards, you can create a kernel using the command:[2]

conda activate ${myenv}
python -m ipykernel install \
    --user \
    --name "${myenv}" --display-name "${myenv}"

Reload the browser window for the kernel to take effect.

%%ai chatgpt -f text
How to create a conda environment that inherit all the packages from the base
environment? Will this take a long time and create duplicate files?
Footnotes
  1. See the documentation for more details on managing conda environment.

  2. See the documentation for more details on creating kernels for conda environments.