Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

How do I add a color column to the following dataframe so that color='green' if Set == 'Z' , and color='red' otherwise?

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

If you have more than two conditions then use np.select. For example, if you want color to be

  • yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
  • otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
  • otherwise purple when (df['Type'] == 'B')
  • otherwise black,
  • then use

    df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
    conditions = [
        (df['Set'] == 'Z') & (df['Type'] == 'A'),
        (df['Set'] == 'Z') & (df['Type'] == 'B'),
        (df['Type'] == 'B')]
    choices = ['yellow', 'blue', 'purple']
    df['color'] = np.select(conditions, choices, default='black')
    print(df)
    

    which yields

      Set Type   color
    0   Z    A  yellow
    1   Z    B    blue
    2   X    B  purple
    3   Y    C   black
    

    List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

    Example list comprehension:

    df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
    

    %timeit tests:

    import pandas as pd
    import numpy as np
    df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
    %timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
    %timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
    %timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
    1000 loops, best of 3: 239 µs per loop
    1000 loops, best of 3: 523 µs per loop
    1000 loops, best of 3: 263 µs per loop
                    Note that, with much larger dataframes (think pd.DataFrame({'Type':list('ABBC')*100000, 'Set':list('ZZXY')*100000})-size), numpy.where outpaces map, but the list comprehension is king (about 50% faster than numpy.where).
    – blacksite
                    Apr 20, 2017 at 16:45
                    Can the list comprehension method be used if the condition needs information from multiple columns? I am looking for something like this (this does not work): df['color'] = ['red' if (x['Set'] == 'Z') & (x['Type'] == 'B') else 'green' for x in df]
    – Matti
                    Jan 1, 2019 at 6:42
                    Add iterrows to the dataframe, then you can access multiple columns via row:  ['red' if (row['Set'] == 'Z') & (row['Type'] == 'B') else 'green' for index, row in in df.iterrows()]
    – cheekybastard
                    Jan 14, 2019 at 1:38
                    Note this nice solution will not work if you need to take replacement values from another series in the data frame, such as df['color_type'] = np.where(df['Set']=='Z', 'green', df['Type'])
    – Paul Rougieux
                    Sep 17, 2019 at 15:28
                    @cheekybastard Or don't, since .iterrows() is notoriously sluggish and the DataFrame shouldn't be modified while iterating.
    – AMC
                    Feb 10, 2020 at 1:51
    

    The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

    Simple example using just the "Set" column:

    def set_color(row):
        if row["Set"] == "Z":
            return "red"
        else:
            return "green"
    df = df.assign(color=df.apply(set_color, axis=1))
    print(df)
    
      Set Type  color
    0   Z    A    red
    1   Z    B    red
    2   X    B  green
    3   Y    C  green
    

    Example with more colours and more columns taken into account:

    def set_color(row):
        if row["Set"] == "Z":
            return "red"
        elif row["Type"] == "C":
            return "blue"
        else:
            return "green"
    df = df.assign(color=df.apply(set_color, axis=1))
    print(df)
    
      Set Type  color
    0   Z    A    red
    1   Z    B    red
    2   X    B  green
    3   Y    C   blue
    

    Edit (21/06/2019): Using plydata

    It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

    from plydata import define, if_else
    

    Simple if_else:

    df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))
    print(df)
    
      Set Type  color
    0   Z    A    red
    1   Z    B    red
    2   X    B  green
    3   Y    C  green
    

    Nested if_else:

    df = define(df, color=if_else(
        'Set=="Z"',
        '"red"',
        if_else('Type=="C"', '"green"', '"blue"')))
    print(df)                            
    
      Set Type  color
    0   Z    A    red
    1   Z    B    red
    2   X    B   blue
    3   Y    C  green
    

    Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

    def map_values(row, values_dict):
        return values_dict[row]
    values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}
    df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})
    df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))
    

    What's it look like:

    Out[2]: INDICATOR VALUE NEW_VALUE 0 A 10 1 1 B 9 2 2 C 8 3 3 D 7 4

    This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

    And of course you could always do this:

    df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)
    

    But that approach is more than three times as slow as the apply approach from above, on my machine.

    And you could also do this, using dict.get:

    df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]
                    Update: On 100,000,000 rows, 52 string values, .apply() takes 47 seconds, versus only 5.91 seconds for .map().
    – AMC
                    Feb 10, 2020 at 2:18
    

    You can simply use the powerful .loc method and use one condition or several depending on your need (tested with pandas=1.0.5).

    Code Summary:

    df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
    df['Color'] = "red"
    df.loc[(df['Set']=="Z"), 'Color'] = "green"
    #practice!
    df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
    

    Explanation:

    df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
    # df so far: 
      Type Set  
    0    A   Z 
    1    B   Z 
    2    B   X 
    3    C   Y
    

    add a 'color' column and set all values to "red"

    df['Color'] = "red"
    

    Apply your single condition:

    df.loc[(df['Set']=="Z"), 'Color'] = "green"
    # df: 
      Type Set  Color
    0    A   Z  green
    1    B   Z  green
    2    B   X    red
    3    C   Y    red
    

    or multiple conditions if you want:

    df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
    

    You can read on Pandas logical operators and conditional selection here: Logical operators for boolean indexing in Pandas

    You can use pandas methods where and mask:

    df['color'] = 'green'
    df['color'] = df['color'].where(df['Set']=='Z', other='red')
    # Replace values where the condition is False
    
    df['color'] = 'red'
    df['color'] = df['color'].mask(df['Set']=='Z', other='green')
    # Replace values where the condition is True
    

    Alternatively, you can use the method transform with a lambda function:

    df['color'] = df['Set'].transform(lambda x: 'green' if x == 'Z' else 'red')
    

    Output:

      Type Set  color
    1    A   Z  green
    2    B   Z  green
    3    B   X    red
    4    C   Y    red
    

    Performance comparison from @chai:

    import pandas as pd
    import numpy as np
    df = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
    %timeit df['color1'] = 'red'; df['color1'].where(df['Set']=='Z','green')
    %timeit df['color2'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
    %timeit df['color3'] = np.where(df['Set']=='Z', 'red', 'green')
    %timeit df['color4'] = df.Set.map(lambda x: 'red' if x == 'Z' else 'green')
    397 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    976 ms ± 241 ms per loop
    673 ms ± 139 ms per loop
    796 ms ± 182 ms per loop
    

    if you have only 2 choices, use np.where()

    df = pd.DataFrame({'A':range(3)})
    df['B'] = np.where(df.A>2, 'yes', 'no')
    

    if you have over 2 choices, maybe apply() could work input

    arr = pd.DataFrame({'A':list('abc'), 'B':range(3), 'C':range(3,6), 'D':range(6, 9)})
    

    and arr is

        A   B   C   D
    0   a   0   3   6
    1   b   1   4   7
    2   c   2   5   8
    

    if you want the column E tobe if arr.A =='a' then arr.B elif arr.A=='b' then arr.C elif arr.A == 'c' then arr.D else something_else

    arr['E'] = arr.apply(lambda x: x['B'] if x['A']=='a' else(x['C'] if x['A']=='b' else(x['D'] if x['A']=='c' else 1234)), axis=1)
    

    and finally the arr is

        A   B   C   D   E
    0   a   0   3   6   0
    1   b   1   4   7   4
    2   c   2   5   8   8
    

    One liner with .apply() method is following:

    df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')
    

    After that, df data frame looks like this:

    >>> print(df)
      Type Set  color
    0    A   Z  green
    1    B   Z  green
    2    B   X    red
    3    C   Y    red
    

    The case_when function from pyjanitor is a wrapper around pd.Series.mask and offers a chainable/convenient form for multiple conditions:

    For a single condition:

    df.case_when(
        df.col1 == "Z",  # condition
        "green",         # value if True
        "red",           # value if False
        column_name = "color"
      Type Set  color
    1    A   Z  green
    2    B   Z  green
    3    B   X    red
    4    C   Y    red
    

    For multiple conditions:

    df.case_when(
        df.Set.eq('Z') & df.Type.eq('A'), 'yellow', # condition, result
        df.Set.eq('Z') & df.Type.eq('B'), 'blue',   # condition, result
        df.Type.eq('B'), 'purple',                  # condition, result
        'black',              # default if none of the conditions evaluate to True
        column_name = 'color'  
      Type  Set   color
    1    A   Z  yellow
    2    B   Z    blue
    3    B   X  purple
    4    C   Y   black
    

    More examples can be found here

    Here is an easy one-liner you can use when you have one or several conditions:

    df['color'] = np.select(condlist=[df['Set']=="Z", df['Set']=="Y"], choicelist=["green", "yellow"], default="red")
    

    Easy and good to go!

    See more here: https://numpy.org/doc/stable/reference/generated/numpy.select.html

    If you're working with massive data, a memoized approach would be best:

    # First create a dictionary of manually stored values
    color_dict = {'Z':'red'}
    # Second, build a dictionary of "other" values
    color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}
    # Next, merge the two
    color_dict.update(color_dict_other)
    # Finally, map it to your column
    df['color'] = df['Set'].map(color_dict)
    

    This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4

    E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.

    Alright, so with only 2 distinct values to map, 100,000,000 rows, it takes 6.67 seconds to run without "memoization", and 9.86 seconds with. – AMC Feb 10, 2020 at 2:31 100,000,000 rows, 52 distinct values, where 1 of those maps to the first output value, and the other 51 all correspond to the other: 7.99 seconds without memoization, 11.1 seconds with. – AMC Feb 10, 2020 at 2:48 Are your values in random order? Or are they back to back? High speed of pandas could be due to caching @AMC – Yaakov Bressler Feb 10, 2020 at 4:14 Are your values in random order? Or are they back to back? Values are random, selected using random.choices(). – AMC Feb 10, 2020 at 22:56

    A Less verbose approach using np.select:

    a = np.array([['A','Z'],['B','Z'],['B','X'],['C','Y']])
    df = pd.DataFrame(a,columns=['Type','Set'])
    conditions = [
        df['Set'] == 'Z'
    outputs = [
        'Green'
                 # conditions Z is Green, Red Otherwise.
    res = np.select(conditions, outputs, 'Red')
    array(['Green', 'Green', 'Red', 'Red'], dtype='<U5')
    df.insert(2, 'new_column',res)    
        Type    Set new_column
    0   A   Z   Green
    1   B   Z   Green
    2   B   X   Red
    3   C   Y   Red
    df.to_numpy()    
    array([['A', 'Z', 'Green'],
           ['B', 'Z', 'Green'],
           ['B', 'X', 'Red'],
           ['C', 'Y', 'Red']], dtype=object)
    %%timeit conditions = [df['Set'] == 'Z'] 
    outputs = ['Green'] 
    np.select(conditions, outputs, 'Red')
    134 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    df2 = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
    %%timeit conditions = [df2['Set'] == 'Z'] 
    outputs = ['Green'] 
    np.select(conditions, outputs, 'Red')
    188 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    				Python Pandas replace values in one column based on conditional in multiple other columns
                    See more linked questions