python - Dask how to pivot DataFrame

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am using the code below but get an error after pivoting the DataFrame:

dataframe:

    name        day        value    time
0   MAC000002   2012-12-16  0.147   09:30:00
1   MAC000002   2012-12-16  0.110   10:00:00
2   MAC000002   2012-12-16  0.736   10:30:00
3   MAC000002   2012-12-16  0.404   11:00:00
4   MAC000003   2012-12-16  0.845   00:30:00
Read in data, and pivot
ddf = dd.read_csv('data.csv')
#I added this but didnt fix the error below
ddf.index.name = 'index'
#dask requires string as category type
ddf['name'] = ddf['name'].astype('category')
ddf['name'] =ddf['name'].cat.as_known()
#pivot the table
df = ddf.pivot_table(columns='name', values='value', index='index')
df.head()
#KeyError: 'index'
Expected result (with or without index) - pivot rows to columns without any value modification:
MAC000002  MAC000003  ...
0.147      0.845
0.110      ...
0.736      ...
0.404      ...
Any idea why I am getting a KeyError 'index' and how I can overcome this?
                @Cezary.Sz KeyError                                  Traceback (most recent call last) <ipython-input-31-c42a15b2c7cf> in <module>() ----> 1 df.head()
– proximacentauri
                Oct 15, 2018 at 1:00
                Instead of naming the index try ddf.reset_index() so that you have a column named 'index'.
– Soren
                May 4, 2021 at 1:31
According to the docs for pivot_table, value of index kwarg should refer to an existing column, so instead of setting name to the index, a column should be created with the desired index value:
# ddf.index.name = 'index'
ddf['index'] = ddf.index
Note that this assumes that the index is what you are really pivoting by.
Below is a reproducible snippet:
data = """
   | name       | day       | value |   time
0  | MAC000002  | 2012-12-16|  0.147|   09:30:00
1  | MAC000002  | 2012-12-16|  0.110|   10:00:00
2  | MAC000002  | 2012-12-16|  0.736|   10:30:00
3  | MAC000002  | 2012-12-16|  0.404|   11:00:00
4  | MAC000003  | 2012-12-16|  0.845|   00:30:00
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(data), sep='|')
df.columns = [c.strip() for c in df.columns]
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)
ddf['index'] = ddf.index
#dask requires string as category type
ddf['name'] = ddf['name'].astype('category')
ddf['name'] =ddf['name'].cat.as_known()
ddf.pivot_table(columns='name', values='value', index='index').compute()
# name    MAC000002     MAC000003  
# index                            
# 0             0.147           NaN
# 1             0.110           NaN
# 2             0.736           NaN
# 3             0.404           NaN
# 4               NaN         0.845
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.