Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here's a subsection of it:
ID Age BMI Risk Factor
PT 6 48 19.3 4
PT 8 43 20.9 NaN
PT 2 39 18.1 3
PT 9 41 19.5 NaN
Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to zscore normalize pandas column with nans?
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
I'm interested in applying this solution to all of my columns except the ID column to produce a new dataframe which I can save as an Excel file using
df2.to_excel("Z-Scores.xlsx")
So basically; how can I compute z-scores for each column (ignoring NaN values) and push everything into a new dataframe?
SIDENOTE: there is a concept in pandas called "indexing" which intimidates me because I do not understand it well. If indexing is a crucial part of solving this problem, please dumb down your explanation of indexing.
Using Scipy's zscore function:
df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])
| | A | B | C |
|---:|----:|----:|----:|
| 0 | 163 | 163 | 159 |
| 1 | 120 | 153 | 181 |
| 2 | 130 | 199 | 108 |
| 3 | 108 | 188 | 157 |
| 4 | 109 | 171 | 119 |
from scipy.stats import zscore
df.apply(zscore)
| | A | B | C |
|---:|----------:|----------:|----------:|
| 0 | 1.83447 | -0.708023 | 0.523362 |
| 1 | -0.297482 | -1.30804 | 1.3342 |
| 2 | 0.198321 | 1.45205 | -1.35632 |
| 3 | -0.892446 | 0.792025 | 0.449649 |
| 4 | -0.842866 | -0.228007 | -0.950897 |
If not all the columns of your data frame are numeric, then you can apply the Z-score function only to the numeric columns using the select_dtypes function:
# Note that `select_dtypes` returns a data frame. We are selecting only the columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].apply(zscore)
| | A | B | C |
|---:|----------:|----------:|----------:|
| 0 | 1.83447 | -0.708023 | 0.523362 |
| 1 | -0.297482 | -1.30804 | 1.3342 |
| 2 | 0.198321 | 1.45205 | -1.35632 |
| 3 | -0.892446 | 0.792025 | 0.449649 |
| 4 | -0.842866 | -0.228007 | -0.950897 |
Build a list from the columns and remove the column you don't want to calculate the Z score for:
In [66]:
cols = list(df.columns)
cols.remove('ID')
df[cols]
Out[66]:
Age BMI Risk Factor
0 6 48 19.3 4
1 8 43 20.9 NaN
2 2 39 18.1 3
3 9 41 19.5 NaN
In [68]:
# now iterate over the remaining columns and create a new zscore column
for col in cols:
col_zscore = col + '_zscore'
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
Out[68]:
ID Age BMI Risk Factor Age_zscore BMI_zscore Risk_zscore \
0 PT 6 48 19.3 4 -0.093250 1.569614 -0.150946
1 PT 8 43 20.9 NaN 0.652753 0.074744 1.459148
2 PT 2 39 18.1 3 -1.585258 -1.121153 -1.358517
3 PT 9 41 19.5 NaN 1.025755 -0.523205 0.050315
Factor_zscore
0 1
1 NaN
2 -1
3 NaN
If you want to calculate the zscore for all of the columns, you can just use the following:
df_zscore = (df - df.mean())/df.std()
–
Here's other way of getting Zscore using custom function:
In [6]: import pandas as pd; import numpy as np
In [7]: np.random.seed(0) # Fixes the random seed
In [8]: df = pd.DataFrame(np.random.randn(5,3), columns=["randomA", "randomB","randomC"])
In [9]: df # watch output of dataframe
Out[9]:
randomA randomB randomC
0 1.764052 0.400157 0.978738
1 2.240893 1.867558 -0.977278
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
## Create custom function to compute Zscore
In [10]: def z_score(df):
....: df.columns = [x + "_zscore" for x in df.columns.tolist()]
....: return ((df - df.mean())/df.std(ddof=0))
....:
## make sure you filter or select columns of interest before passing dataframe to function
In [11]: z_score(df) # compute Zscore
Out[11]:
randomA_zscore randomB_zscore randomC_zscore
0 0.798350 -0.106335 0.731041
1 1.505002 1.939828 -1.577295
2 -0.407899 -0.875374 -0.545799
3 -1.207392 -0.463464 1.292230
4 -0.688061 -0.494655 0.099824
Result reproduced using scipy.stats zscore
In [12]: from scipy.stats import zscore
In [13]: df.apply(zscore) # (Credit: Manuel)
Out[13]:
randomA randomB randomC
0 0.798350 -0.106335 0.731041
1 1.505002 1.939828 -1.577295
2 -0.407899 -0.875374 -0.545799
3 -1.207392 -0.463464 1.292230
4 -0.688061 -0.494655 0.099824
for Z score, we can stick to documentation instead of using 'apply' function
from scipy.stats import zscore
df_zscore = zscore(cols as array, axis=1)
The almost one-liner solution:
df2 = (df.ix[:,1:] - df.ix[:,1:].mean()) / df.ix[:,1:].std()
df2['ID'] = df['ID']
Another way is to call StandardScaler() from scikit-learn. Simply instantiate StandardScaler and call fit_transform using the relevant columns as input. The result is a numpy array which you can assign back to the dataframe as new columns (or work on the array itself etc.).
from sklearn.preprocessing import StandardScaler
cols = ['col1', 'col2']
new_cols = [f"{c}_zscore" for c in cols]
sc = StandardScaler()
df[new_cols] = sc.fit_transform(df[cols])
When we are dealing with time-series, calculating z-scores (or anomalies - not the same thing, but you can adapt this code easily) is a bit more complicated. For example, you have 10 years of temperature data measured weekly. To calculate z-scores for the whole time-series, you have to know the means and standard deviations for each day of the year. So, let's get started:
Assume you have a pandas DataFrame. First of all, you need a DateTime index. If you don't have it yet, but luckily you do have a column with dates, just make it as your index. Pandas will try to guess the date format. The goal here is to have DateTimeIndex. You can check it out by trying:
type(df.index)
If you don't have one, let's make it.
df.index = pd.DatetimeIndex(df[datecolumn])
df = df.drop(datecolumn,axis=1)
Next step is to calculate mean and standard deviation for each group of days. For this, we use the groupby method.
mean = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanmean)
std = pd.groupby(df,by=[df.index.dayofyear]).aggregate(np.nanstd)
Finally, we loop through all the dates, performing the calculation (value - mean)/stddev; however, as mentioned, for time-series this is not so straightforward.
df2 = df.copy() #keep a copy for future comparisons
for y in np.unique(df.index.year):
for d in np.unique(df.index.dayofyear):
df2[(df.index.year==y) & (df.index.dayofyear==d)] = (df[(df.index.year==y) & (df.index.dayofyear==d)]- mean.ix[d])/std.ix[d]
df2.index.name = 'date' #this is just to look nicer
df2 #this is your z-score dataset.
The logic inside the for loops is: for a given year we have to match each dayofyear to its mean and stdev. We run this for all the years in your time-series.
To calculate a z-score for an entire column quickly, do as follows:
from scipy.stats import zscore
import pandas as pd
df = pd.DataFrame({'num_1': [1,2,3,4,5,6,7,8,9,3,4,6,5,7,3,2,9]})
df['num_1_zscore'] = zscore(df['num_1'])
display(df)
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.