Data retrieved from: https://www.kaggle.com/steveahn/memory-test-on-drugged-islanders-data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('archive_2/Islander_data.csv')
When we call to see the DataFrame we immediately see that participants data is exposing their name which is almost always unacceptable, lets fix that.
df.head()
first_name | last_name | age | Happy_Sad_group | Dosage | Drug | Mem_Score_Before | Mem_Score_After | Diff | |
---|---|---|---|---|---|---|---|---|---|
0 | Bastian | Carrasco | 25 | H | 1 | A | 63.5 | 61.2 | -2.3 |
1 | Evan | Carrasco | 52 | S | 1 | A | 41.6 | 40.7 | -0.9 |
2 | Florencia | Carrasco | 29 | H | 1 | A | 59.7 | 55.1 | -4.6 |
3 | Holly | Carrasco | 50 | S | 1 | A | 51.7 | 51.2 | -0.5 |
4 | Justin | Carrasco | 52 | H | 1 | A | 47.0 | 47.1 | 0.1 |
del df['first_name']
del df['last_name']
df.insert(0, 'Participant', [f"participant_{i}" for i in range(0, df.shape[0])])
df.head()
Participant | age | Happy_Sad_group | Dosage | Drug | Mem_Score_Before | Mem_Score_After | Diff | |
---|---|---|---|---|---|---|---|---|
0 | participant_0 | 25 | H | 1 | A | 63.5 | 61.2 | -2.3 |
1 | participant_1 | 52 | S | 1 | A | 41.6 | 40.7 | -0.9 |
2 | participant_2 | 29 | H | 1 | A | 59.7 | 55.1 | -4.6 |
3 | participant_3 | 50 | S | 1 | A | 51.7 | 51.2 | -0.5 |
4 | participant_4 | 52 | H | 1 | A | 47.0 | 47.1 | 0.1 |
Lets now group the dataframe by Happy/Sad condition, drug and participant, lets see what the min, max, and mean is
df = df.groupby(by=['Happy_Sad_group', 'Drug', 'Dosage']).mean()
df
age | Mem_Score_Before | Mem_Score_After | Diff | |||
---|---|---|---|---|---|---|
Happy_Sad_group | Drug | Dosage | ||||
H | A | 1 | 36.000000 | 60.563636 | 59.954545 | -0.609091 |
2 | 36.181818 | 55.872727 | 57.990909 | 2.118182 | ||
3 | 41.363636 | 54.881818 | 78.463636 | 23.581818 | ||
S | 1 | 35.636364 | 58.336364 | 59.045455 | 0.709091 | |
2 | 38.181818 | 54.945455 | 56.000000 | 1.054545 | ||
3 | 39.636364 | 61.663636 | 59.963636 | -1.700000 | ||
T | 1 | 36.909091 | 52.418182 | 53.709091 | 1.290909 | |
2 | 48.272727 | 59.554545 | 61.136364 | 1.581818 | ||
3 | 39.727273 | 62.354545 | 58.927273 | -3.427273 | ||
S | A | 1 | 43.833333 | 56.133333 | 57.275000 | 1.141667 |
2 | 42.090909 | 67.100000 | 76.745455 | 9.645455 | ||
3 | 40.818182 | 54.909091 | 76.609091 | 21.700000 | ||
S | 1 | 41.636364 | 54.700000 | 58.763636 | 4.063636 | |
2 | 36.272727 | 63.445455 | 60.881818 | -2.563636 | ||
3 | 33.181818 | 57.818182 | 55.227273 | -2.590909 | ||
T | 1 | 38.454545 | 59.072727 | 55.300000 | -3.772727 | |
2 | 38.636364 | 50.181818 | 50.900000 | 0.718182 | ||
3 | 44.800000 | 59.800000 | 59.950000 | 0.150000 |
Using NumPy we can easily see the shape and dimensions of the new dataframe and also tell what type of data is stored in the Diff category.
print('shape:', df.shape, 'data in Diff category:', df['Diff'].dtype)
shape: (18, 4) data in Diff category: float64
Lastly it would be really nice to plot this data! Using seaborn we set the theme to show a whitegrid, next we use the sns.boxplot() method to plot a boxplot of the data, this is done by specifying the x and y to be the dataframe columns we want, then setting the hue to be the next index column at which we are filtering the data in the boxplot at, and lastly we tell seaborn's boxplot method that our data is coming from our dataframe df. Calling the plt (matplotlib.pyplot) method plot() displays our graph. This show my experience and mastery over Seaborn categorical plots.
sns.set_theme(style='whitegrid')
sns.boxplot(x='Happy_Sad_group', y='Diff', hue='Drug', data=df.reset_index())
plt.plot()
[]