This program uses world health organization data to analyze trends in mortality rates. Although this is not novel in nature it does give insight into my current python programming ability and data processing skills. author: Nathanael Bowley (github @nathanbowley98) course: NESC3505 source: https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide retrieved 10/30/2020 version 2: retrieved 12/18/2020
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import matplotlib.dates as mdates
df = pd.read_csv("COVID-19.csv")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61900 entries, 0 to 61899 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateRep 61900 non-null object 1 day 61900 non-null int64 2 month 61900 non-null int64 3 year 61900 non-null int64 4 cases 61900 non-null int64 5 deaths 61900 non-null int64 6 countriesAndTerritories 61900 non-null object 7 geoId 61625 non-null object 8 countryterritoryCode 61777 non-null object 9 popData2019 61777 non-null float64 10 continentExp 61900 non-null object 11 Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 59021 non-null float64 dtypes: float64(2), int64(5), object(5) memory usage: 5.7+ MB
df.head()
dateRep | day | month | year | cases | deaths | countriesAndTerritories | geoId | countryterritoryCode | popData2019 | continentExp | Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12/14/2020 | 14 | 12 | 2020 | 746 | 6 | Afghanistan | AF | AFG | 38041757.0 | Asia | 9.013779 |
1 | 12/13/2020 | 13 | 12 | 2020 | 298 | 9 | Afghanistan | AF | AFG | 38041757.0 | Asia | 7.052776 |
2 | 12/12/2020 | 12 | 12 | 2020 | 113 | 11 | Afghanistan | AF | AFG | 38041757.0 | Asia | 6.868768 |
3 | 12/11/2020 | 11 | 12 | 2020 | 63 | 10 | Afghanistan | AF | AFG | 38041757.0 | Asia | 7.134266 |
4 | 12/10/2020 | 10 | 12 | 2020 | 202 | 16 | Afghanistan | AF | AFG | 38041757.0 | Asia | 6.968658 |
Do we have any null or na values? How many?
df.isna().sum()
dateRep 0 day 0 month 0 year 0 cases 0 deaths 0 countriesAndTerritories 0 geoId 275 countryterritoryCode 123 popData2019 123 continentExp 0 Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 2879 dtype: int64
We need to remove the null / na values, we can do this with df.dropna()
df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Rechecking dataframe head and how many null / na values
df.head()
dateRep | day | month | year | cases | deaths | countriesAndTerritories | geoId | countryterritoryCode | popData2019 | continentExp | Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12/14/2020 | 14 | 12 | 2020 | 746 | 6 | Afghanistan | AF | AFG | 38041757.0 | Asia | 9.013779 |
1 | 12/13/2020 | 13 | 12 | 2020 | 298 | 9 | Afghanistan | AF | AFG | 38041757.0 | Asia | 7.052776 |
2 | 12/12/2020 | 12 | 12 | 2020 | 113 | 11 | Afghanistan | AF | AFG | 38041757.0 | Asia | 6.868768 |
3 | 12/11/2020 | 11 | 12 | 2020 | 63 | 10 | Afghanistan | AF | AFG | 38041757.0 | Asia | 7.134266 |
4 | 12/10/2020 | 10 | 12 | 2020 | 202 | 16 | Afghanistan | AF | AFG | 38041757.0 | Asia | 6.968658 |
df.isna().sum()
dateRep 0 day 0 month 0 year 0 cases 0 deaths 0 countriesAndTerritories 0 geoId 0 countryterritoryCode 0 popData2019 0 continentExp 0 Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 0 dtype: int64
I want to see cases:deaths ratios for each country each day to see if certain countries had or have a higher death ratio trend:
df.loc[:, "case_death_ratio"] = df.cases / df.deaths
df.head()
dateRep | day | month | year | cases | deaths | countriesAndTerritories | geoId | countryterritoryCode | popData2019 | continentExp | Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 | case_death_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12/14/2020 | 14 | 12 | 2020 | 746 | 6 | Afghanistan | AF | AFG | 38041757.0 | Asia | 9.013779 | 124.333333 |
1 | 12/13/2020 | 13 | 12 | 2020 | 298 | 9 | Afghanistan | AF | AFG | 38041757.0 | Asia | 7.052776 | 33.111111 |
2 | 12/12/2020 | 12 | 12 | 2020 | 113 | 11 | Afghanistan | AF | AFG | 38041757.0 | Asia | 6.868768 | 10.272727 |
3 | 12/11/2020 | 11 | 12 | 2020 | 63 | 10 | Afghanistan | AF | AFG | 38041757.0 | Asia | 7.134266 | 6.300000 |
4 | 12/10/2020 | 10 | 12 | 2020 | 202 | 16 | Afghanistan | AF | AFG | 38041757.0 | Asia | 6.968658 | 12.625000 |
get all the areas in the data
areas = df.countriesAndTerritories.unique()
Recursive method to get all the countries while being faster than a for loop
def recursive_countries(list, n):
#base case to start
if n == 0:
print("\nThese are all the countries you can choose from:")
#base case to escape
if n == len(list):
print("No more countries to show!")
#recursive case
else:
print("•", list[n])
return recursive_countries(list, n + 1)
country = ""
while True:
print("Enter in a country name: (Type areas for list of areas)")
user_in = input()
if user_in == 'areas':
recursive_countries(list=areas, n=0)
elif user_in in areas:
country = user_in
#user input is acceptable
if country == user_in:
print("You have chosen", country + ", nice choice! Lets see the data.")
country_df = df[df.countriesAndTerritories == country]
#fix the order of the data
country_df = country_df.iloc[::-1]
print("This is the head of the data for " + country)
print(country_df.head())
country_df.loc[:,"dateRep"] = pd.to_datetime(country_df['dateRep'], format="%m/%d/%Y").dt.strftime("%Y-%m-%d")
fig, ax = plt.subplots(nrows=2, ncols=1)
fig.suptitle("Cases and Deaths for Data Date Range for " + country)
plt.setp(ax[0].xaxis.get_majorticklabels(), rotation=25)
ax[0].plot(country_df.dateRep, country_df.cases)
ax[0].set_xlabel("Date since data collection in " + country + "(interval: 20 days)")
ax[0].set_ylabel("Cases in " + country)
plt.setp(ax[1].xaxis.get_majorticklabels(), rotation=25)
ax[1].plot(country_df.dateRep, country_df.deaths)
ax[1].set_xlabel("Date since data collection in " + country + "(interval: 20 days)")
ax[1].set_ylabel("Deaths in " + country)
days = mdates.DayLocator(interval=20)
ax[0].xaxis.set_major_locator(days)
ax[1].xaxis.set_major_locator(days)
fig.tight_layout()
plt.show(block=True)
elif country != user_in and user_in != "areas":
print("Error: Incorrect country spelling. Type areas for list of countries and try again.")
Enter in a country name: (Type areas for list of areas) CanadaYou have chosen Canada, nice choice! Lets see the data. This is the head of the data for Canada dateRep day month year cases deaths countriesAndTerritories \ 10652 1/13/2020 13 1 2020 0 0 Canada 10651 1/14/2020 14 1 2020 0 0 Canada 10650 1/15/2020 15 1 2020 0 0 Canada 10649 1/16/2020 16 1 2020 0 0 Canada 10648 1/17/2020 17 1 2020 0 0 Canada geoId countryterritoryCode popData2019 continentExp \ 10652 CA CAN 37411038.0 America 10651 CA CAN 37411038.0 America 10650 CA CAN 37411038.0 America 10649 CA CAN 37411038.0 America 10648 CA CAN 37411038.0 America Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 \ 10652 0.0 10651 0.0 10650 0.0 10649 0.0 10648 0.0 case_death_ratio 10652 NaN 10651 NaN 10650 NaN 10649 NaN 10648 NaN
Enter in a country name: (Type areas for list of areas)