Pandas - CS1302

from __init__ import install_dependencies

await install_dependencies()

%reload_ext divewidgets

In this lab, we will analyze COVID19 data using a powerful package called pandas.
The package name comes from panel data and Python for data analysis.

Loading CSV Files with Pandas¶

DATA.GOV.HK provides an API to retrieve historical data on COVID-19 cases in Hong Kong.

The following uses the urlencode function to create the url that links to a csv file containing probable and confirmed cases of COVID-19 by Aug 1st, 2020.

from urllib.parse import urlencode

url_data_gov_hk_get = "https://api.data.gov.hk/v1/historical-archive/get-file"
url_covid_csv = "http://www.chp.gov.hk/files/misc/enhanced_sur_covid_19_eng.csv"
time = "20200801-1204"
url_covid = url_data_gov_hk_get + "?" + urlencode({"url": url_covid_csv, "time": time})

print(url_covid)

def simple_encode(string):
    """Returns the string with : and / encoded to %3A and %2F respectively."""
    # YOUR CODE HERE
    raise NotImplementedError

Source

# tests
assert simple_encode("http://www.chp.gov.hk/files/misc/enhanced_sur_covid_19_eng.csv") == "http%3A%2F%2Fwww.chp.gov.hk%2Ffiles%2Fmisc%2Fenhanced_sur_covid_19_eng.csv"

Like the function open that loads a file into memory, pandas has a function read_csv that loads a csv file. The csv file can even reside on the web:

import pandas as pd

df_covid = pd.read_csv(url_covid)

print(type(df_covid))
df_covid

url_building_csv = "http://www.chp.gov.hk/files/misc/building_list_eng.csv"
time = "20200801-1203"
url_building = (
    url_data_gov_hk_get + "?" + urlencode({"url": url_building_csv, "time": time})
)
# YOUR CODE HERE
raise NotImplementedError
df_building

Source

# tests
assert all(df_building.columns == [
    "District",
    "Building name",
    "Last date of residence of the case(s)",
    "Related probable/confirmed cases"])  # check column names

Selecting and Removing columns¶

We can obtain the column labels of a Dataframe using its columns attribute.

df_covid.columns

Using the indexing operator [], a column of a DataFrame can be returned as a Series object, which is essentially a named array.
We can further use the method value_counts to return the counts of different values in another Series object.

series_gender_counts = df_covid[
    "Gender"
].value_counts()  # return the number of male and female cases

print(type(series_gender_counts))
series_gender_counts

# YOUR CODE HERE
raise NotImplementedError
series_district_counts

Source

# tests
assert all(series_district_counts[["Wong Tai Sin", "Kwun Tong"]] == [313, 212])

In df_covid, it appears that the column Name of hospital admitted contains no information. We can confirm this by

returning the column as a Series with df_covid_cases['Name of hospital admitted'], and
printing an array of unique column values using the method unique.

df_covid["Name of hospital admitted"].unique()

# YOUR CODE HERE
raise NotImplementedError
df_covid

Source

# tests
assert all(df_covid.columns == [
        "Case no.",
        "Report date",
        "Date of onset",
        "Gender",
        "Age",
        "Hospitalised/Discharged/Deceased",
        "HK/Non-HK resident",
        "Case classification*",
        "Confirmed/probable"])

Selecting Rows of DataFrame¶

We can select the confirmed male cases using the attribute loc and the indexing operator [].

df_confirmed_male = df_covid.loc[
    (df_covid["Confirmed/probable"] == "Confirmed") & (df_covid["Gender"] == "M")
]
print(type(df_covid.loc))
df_confirmed_male

# YOUR CODE HERE
raise NotImplementedError
df_confirmed_local

Source

# tests
assert set(df_confirmed_local["Case classification*"].unique()) == {
    "Epidemiologically linked with local case",
    "Local case"}

def case_counts(district):
    # YOUR CODE HERE
    raise NotImplementedError

Source

# tests
assert case_counts("Kwai Tsing") == 109