Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have created different bins for each column and grouped the DataFrame based on these.

import pandas as pd
import numpy as np
np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'value'])
# for simplicity, I use the same bin here
bins = np.arange(-3, 4, 0.05)
df['a_bins'] = pd.cut(df['a'], bins=bins)
df['b_bins'] = pd.cut(df['b'], bins=bins)
df['c_bins'] = pd.cut(df['c'], bins=bins)

The output of df.groupby(['a_bins','b_bins','c_bins']).size() indicates the group length is 2685619.

Calculate statistics of each group

Then, the statistics of each group are calculated like this:

%%timeit
df.groupby(['a_bins','b_bins','c_bins']).agg({'value':['mean']})
>>> 16.9 s ± 637 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Expected output

  • Is it possible to speed this up?
  • The quicker method should also support finding the value by inputs of a, b, and c values, like this:
  • df.groupby(['a_bins','b_bins','c_bins']).agg({'value':['mean']}).loc[(-1.72, 0.32, 1.18)]
    >>> -0.252436
                    Kindly create a sample dataframe with expected  output , so we are sure our results match and we r on d right track
    – sammywemmy
                    Dec 22, 2021 at 21:40
                    @sammywemmy Thanks for the suggestion. np.random.seed() can make sure we have the same DataFrame. I updated the expected outputs now.
    – zxdawn
                    Dec 22, 2021 at 21:56
    

    For this data, I'd suggest you pivot the data, and pass the mean. Usually, this is faster since you are hitting the entire dataframe, instead of going through each group:

    .pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value') .mean() .sort_index() # ignore this if you are not fuzzy on order a_bins b_bins c_bins (-2.15, -2.1] (0.25, 0.3] (-1.3, -1.25] 0.929100 (0.75, 0.8] (-0.3, -0.25] 0.480411 (-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35] -1.684900 (0.75, 0.8] (-0.25, -0.2] -1.184411 (-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176 (1.7, 1.75] (-0.75, -0.7] (1.05, 1.1] -0.229518 (1.85, 1.9] (-0.4, -0.35] (1.8, 1.85] 0.003017 (1.9, 1.95] (-1.45, -1.4] (0.1, 0.15] 0.949361 (2.05, 2.1] (-0.35, -0.3] (-0.65, -0.6] 0.763184 (2.25, 2.3] (-0.95, -0.9] (0.1, 0.15] 2.539432

    This matches the output from the groupby:

    .groupby(['a_bins','b_bins','c_bins']) .agg({'value':['mean']}) .dropna() .squeeze() a_bins b_bins c_bins (-2.15, -2.1] (0.25, 0.3] (-1.3, -1.25] 0.929100 (0.75, 0.8] (-0.3, -0.25] 0.480411 (-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35] -1.684900 (0.75, 0.8] (-0.25, -0.2] -1.184411 (-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176 (1.7, 1.75] (-0.75, -0.7] (1.05, 1.1] -0.229518 (1.85, 1.9] (-0.4, -0.35] (1.8, 1.85] 0.003017 (1.9, 1.95] (-1.45, -1.4] (0.1, 0.15] 0.949361 (2.05, 2.1] (-0.35, -0.3] (-0.65, -0.6] 0.763184 (2.25, 2.3] (-0.95, -0.9] (0.1, 0.15] 2.539432 Name: (value, mean), Length: 100, dtype: float64

    The pivot option gives a speed of 3.72ms on my PC, while I had to terminate the groupby option, as it was taking too long (my PC is quite old :))

    Again, the reason why this works/is faster is because the mean is hitting the entire dataframe, and not going through groups in the groupby.

    As to your other question, you can index it easily:

    bin_mean = (df .pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value') .mean() .sort_index() # ignore this if you are not fuzzy on order bin_mean.loc[(-1.72, 0.32, 1.18)] -0.25243603652138985

    The main problem though is Pandas for categoricals will return for all rows( which is wasteful, and not efficient); pass observed = True and you should notice a dramatic improvement:

    (df.groupby(['a_bins','b_bins','c_bins'], observed=True)
       .agg({'value':['mean']})
                                                  value
    a_bins        b_bins        c_bins                 
    (-2.15, -2.1] (0.25, 0.3]   (-1.3, -1.25]  0.929100
                  (0.75, 0.8]   (-0.3, -0.25]  0.480411
    (-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35]   -1.684900
                  (0.75, 0.8]   (-0.25, -0.2] -1.184411
    (-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
    ...                                             ...
    (1.7, 1.75]   (-0.75, -0.7] (1.05, 1.1]   -0.229518
    (1.85, 1.9]   (-0.4, -0.35] (1.8, 1.85]    0.003017
    (1.9, 1.95]   (-1.45, -1.4] (0.1, 0.15]    0.949361
    (2.05, 2.1]   (-0.35, -0.3] (-0.65, -0.6]  0.763184
    (2.25, 2.3]   (-0.95, -0.9] (0.1, 0.15]    2.539432
    

    Speed is about 7.39ms on my PC, about 2 times less than the pivot option, but way faster now, and that's because only categoricals that exist in the dataframe are used/returned.

    Nice example of pivot! But, when I increase the data to 100000 rows, it would raise Unable to allocate 65.2 GiB for an array with shape (100000, 87554) and data type float64. @Nikita Almakov method still works well. – zxdawn Dec 23, 2021 at 9:01 Hmmmm…. That is a lot of memory. Same issue when you run the groupby with observed = True? @NikitaAlmakov’s library is awesome. – sammywemmy Dec 23, 2021 at 9:59 Ha, I only tested pivot. observed=True works well and it is faster than @NikitaAlmakov's method! convtools: 777 ms ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each), observed=True: 32.1 ms ± 592 µs per loop (mean ± std. dev. of 7 runs, 10 loops each). – zxdawn Dec 23, 2021 at 10:07 Note that for the data of 1000 length, they're similar. observed=True: 7.06 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) and convtools: 7.32 ms ± 667 µs per loop (mean ± std. dev. of 7 runs, 100 loops each). It's hard to pick which one is the answer, hah. Maybe you can make a comparison figure of both methods for different length data? Then, I have no doubt to accept your answer. – zxdawn Dec 23, 2021 at 10:11 @XinZhang this was a pandas-related question, sammywemmy answered it, while I just shared an alternative option, which may be helpful if some stream processing is needed and input data doesn't fit into memory. I'd vote for accepting sammy's one :) – westandskif Dec 23, 2021 at 10:14

    An alternative straightforward solution, based on convtools, which is able to process input stream of data and doesn't require input data to fit into memory:

    import numpy as np
    import pandas as pd
    from convtools import conversion as c
    def c_bin(left, right, bin_size):
        return c.if_(
            c.or_(c.this < left, c.this > right),
            None,
            ((c.this - left) // bin_size).pipe(
                (c.this * bin_size + left, (c.this + 1) * bin_size + left)
    to_binned = c_bin(-3, 4, 0.05)
    to_interval = c.if_(c.this, c.apply_func(pd.Interval, c.this, {}), None)
    a_bins = c.item(0).pipe(to_binned)
    b_bins = c.item(1).pipe(to_binned)
    c_bins = c.item(2).pipe(to_binned)
    converter = (
        c.group_by(a_bins, b_bins, c_bins)
        .aggregate(
                "a_bins": a_bins.pipe(to_interval),
                "b_bins": b_bins.pipe(to_interval),
                "c_bins": c_bins.pipe(to_interval),
                "value_mean": c.ReduceFuncs.Average(c.item(3)),
        .gen_converter()
    np.random.seed(100)
    data = np.random.randn(100, 4)
    df = pd.DataFrame(converter(data)).set_index(["a_bins", "b_bins", "c_bins"])
    df.loc[(-1.72, 0.32, 1.18)]
    

    Timings:

    In [44]: %timeit converter(data)
    438 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    # passing back to pandas, timing the end-to-end thing:
    In [43]: %timeit pd.DataFrame(converter(data)).set_index(["a_bins", "b_bins", "c_bins"]).loc[(-1.72, 0.32, 1.18)]
    2.37 ms ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    JFYI: Shortened output of converter(data):

    {'a_bins': Interval(-0.44999999999999973, -0.3999999999999999, closed='right'), 'b_bins': Interval(0.7000000000000002, 0.75, closed='right'), 'c_bins': Interval(-0.19999999999999973, -0.1499999999999999, closed='right'), 'value_mean': -0.08605564337254189}, {'a_bins': Interval(-0.34999999999999964, -0.2999999999999998, closed='right'), 'b_bins': Interval(-0.1499999999999999, -0.09999999999999964, closed='right'), 'c_bins': Interval(0.050000000000000266, 0.10000000000000009, closed='right'), 'value_mean': 0.18971879197958597}, {'a_bins': Interval(-2.05, -2.0, closed='right'), 'b_bins': Interval(0.75, 0.8000000000000003, closed='right'), 'c_bins': Interval(-0.25, -0.19999999999999973, closed='right'), 'value_mean': -1.1844114274105708}] Thanks for the amazing tool. It's really fast! Is it possible to meet the second requirement of Expected output? BTW, could you explain why this is much faster than @sammywemmy method? – zxdawn Dec 23, 2021 at 8:55 @XinZhang Hope this will be a decent complement to polars/pandas in your toolkit! :) I've updated the above to meet the 2nd requirement (missed this part, sorry for this). Regarding the speed, it's nowhere near as fast as polars/pandas, which perform on lower level & use vectorizations. However convtools is built to generate simple fast raw python ad hoc code to solve problems by dynamic code generation, sometimes it helps :) And also it may improve code-reuse a lot! – westandskif Dec 23, 2021 at 9:41

    This is a good use-case for scipy.stats.binned_statistic_dd. The snippet below computes mean statistic only, but many other statistics are supported (see docs linked above):

    import numpy as np
    import pandas as pd
    np.random.seed(100)
    df = pd.DataFrame(np.random.randn(100, 4), columns=["a", "b", "c", "value"])
    # for simplicity, I use the same bin here
    bins = np.arange(-3, 4, 0.05)
    df["a_bins"] = pd.cut(df["a"], bins=bins)
    df["b_bins"] = pd.cut(df["b"], bins=bins)
    df["c_bins"] = pd.cut(df["c"], bins=bins)
    # this takes about 35 seconds
    result_pandas = df.groupby(["a_bins", "b_bins", "c_bins"]).agg({"value": ["mean"]})
    from scipy.stats import binned_statistic_dd
    # this takes about 20 ms
    result_scipy = binned_statistic_dd(
        df[["a", "b", "c"]].to_numpy(), df["value"], bins=(bins, bins, bins)
    # this is a verbose way to get a dataframe representation
    # for many purposes this probably will not be needed
    # takes about 5 seconds
    temp_list = []
    for na, a in enumerate(result_scipy[1][0][:-1]):
        for nb, b in enumerate(result_scipy[1][1][:-1]):
            for nc, c in enumerate(result_scipy[1][2][:-1]):
                value = result_scipy[0][na, nb, nc]
                temp_list.append([a, b, c, value])
    result_scipy_as_df = pd.DataFrame(temp_list, columns=list("abcx"))
    # check that the result is the same
    result_scipy_as_df["x"].describe() == result_pandas["value"]["mean"].describe()
    

    If you are interested in speeding up this further, this answer might be useful.

    An important caveat is that binned_statistic_dd uses bins that are closed on the right, e.g. [0,1), except for the last one (refer to the Notes in the linked docs), so for consistent bin identifiers one would have to use right=False in pd.cut.

    Here's a look-up example, note that here the exact bin edge location is increased by 1 to get similar result as in pandas:

    aloc, bloc, cloc = -2.12, 0.23, -1.25
    print(result_pandas.loc[(aloc, bloc, cloc)])
    print(result_scipy.statistic[
        np.digitize(aloc, result_scipy.bin_edges[0][1:]),
        np.digitize(bloc, result_scipy.bin_edges[1][1:]),
        np.digitize(cloc, result_scipy.bin_edges[2][1:]),
                    Oh, I realize that @sammywemmy 's method will drop NaN value. If users need the NaN value, then your answer is quite useful! Thanks a lot ;)
    – zxdawn
                    Dec 28, 2021 at 9:18
                    Note that np.digitize() should add right=True, otherwise, the maximum value is out of the index.
    – zxdawn
                    Dec 28, 2021 at 20:45
    

    Because your bins are the same for your 3 columns, use codes from cat accessor:

    %timeit df.groupby([df['a_bins'].cat.codes, df['b_bins'].cat.codes, df['c_bins'].cat.codes])['value'].mean()
    1.82 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
                    They are not same in my real case. The example above is for simplicity as mentioned in the code comment.
    – zxdawn
                    Dec 22, 2021 at 17:21
            

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.