Portfolio

In this example I demonstrate cleaning data to protect participant anonymity, NumPy statistical tests, and usage of seaborn to plot data¶

Data retrieved from: https://www.kaggle.com/steveahn/memory-test-on-drugged-islanders-data

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

df = pd.read_csv('archive_2/Islander_data.csv')

When we call to see the DataFrame we immediately see that participants data is exposing their name which is almost always unacceptable, lets fix that.

In [3]:

df.head()

Out[3]:

	first_name	last_name	age	Happy_Sad_group	Dosage	Drug	Mem_Score_Before	Mem_Score_After	Diff
0	Bastian	Carrasco	25	H	1	A	63.5	61.2	-2.3
1	Evan	Carrasco	52	S	1	A	41.6	40.7	-0.9
2	Florencia	Carrasco	29	H	1	A	59.7	55.1	-4.6
3	Holly	Carrasco	50	S	1	A	51.7	51.2	-0.5
4	Justin	Carrasco	52	H	1	A	47.0	47.1	0.1

In [4]:

del df['first_name']
del df['last_name']
df.insert(0, 'Participant', [f"participant_{i}" for i in range(0, df.shape[0])])
df.head()

Out[4]:

	Participant	age	Happy_Sad_group	Dosage	Drug	Mem_Score_Before	Mem_Score_After	Diff
0	participant_0	25	H	1	A	63.5	61.2	-2.3
1	participant_1	52	S	1	A	41.6	40.7	-0.9
2	participant_2	29	H	1	A	59.7	55.1	-4.6
3	participant_3	50	S	1	A	51.7	51.2	-0.5
4	participant_4	52	H	1	A	47.0	47.1	0.1

Lets now group the dataframe by Happy/Sad condition, drug and participant, lets see what the min, max, and mean is

In [5]:

df = df.groupby(by=['Happy_Sad_group', 'Drug', 'Dosage']).mean()
df

Out[5]:

			age	Mem_Score_Before	Mem_Score_After	Diff
Happy_Sad_group	Drug	Dosage
H	A	1	36.000000	60.563636	59.954545	-0.609091
		2	36.181818	55.872727	57.990909	2.118182
		3	41.363636	54.881818	78.463636	23.581818
	S	1	35.636364	58.336364	59.045455	0.709091
		2	38.181818	54.945455	56.000000	1.054545
		3	39.636364	61.663636	59.963636	-1.700000
	T	1	36.909091	52.418182	53.709091	1.290909
		2	48.272727	59.554545	61.136364	1.581818
		3	39.727273	62.354545	58.927273	-3.427273
S	A	1	43.833333	56.133333	57.275000	1.141667
		2	42.090909	67.100000	76.745455	9.645455
		3	40.818182	54.909091	76.609091	21.700000
	S	1	41.636364	54.700000	58.763636	4.063636
		2	36.272727	63.445455	60.881818	-2.563636
		3	33.181818	57.818182	55.227273	-2.590909
	T	1	38.454545	59.072727	55.300000	-3.772727
		2	38.636364	50.181818	50.900000	0.718182
		3	44.800000	59.800000	59.950000	0.150000

Using NumPy we can easily see the shape and dimensions of the new dataframe and also tell what type of data is stored in the Diff category.

In [6]:

print('shape:', df.shape, 'data in Diff category:', df['Diff'].dtype)

shape: (18, 4) data in Diff category: float64

Lastly it would be really nice to plot this data! Using seaborn we set the theme to show a whitegrid, next we use the sns.boxplot() method to plot a boxplot of the data, this is done by specifying the x and y to be the dataframe columns we want, then setting the hue to be the next index column at which we are filtering the data in the boxplot at, and lastly we tell seaborn's boxplot method that our data is coming from our dataframe df. Calling the plt (matplotlib.pyplot) method plot() displays our graph. This show my experience and mastery over Seaborn categorical plots.

In [7]:

sns.set_theme(style='whitegrid')
sns.boxplot(x='Happy_Sad_group', y='Diff', hue='Drug', data=df.reset_index())
plt.plot()

Out[7]:

[]

Out[7]: