Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
(F.col("ORDER_SUCCESS_URL") != "%16237890%") &
(F.col("ORDER_SUCCESS_URL") != "%30427132%") &
(F.col("ORDER_SUCCESS_URL") != "%242518801%") |
(F.col("ORDER_SUCCESS_URL").isNull())
F.col("ORDERS_VIA_ARTICLE")
).otherwise(F.lit(0))
F.lit(0)
).alias("report_sum_orders_via_article")
return df
Now, I wanted to use the same logic with df.withColumn() instead of df.select().
I tried this (removed the coalesce for now):
def dev_prev_month(clean_joined_traffic_data):
df = clean_joined_traffic_data
df = df.withColumn(
"report_sum_orders_via_article",_sum(
F.when(
(F.col("ORDERS_VIA_ARTICLE") > 0) &
(F.col("ORDER_SUCCESS_URL") != "%16237890%") &
(F.col("ORDER_SUCCESS_URL") != "%30427132%") &
(F.col("ORDER_SUCCESS_URL") != "%242518801%") |
(F.col("ORDER_SUCCESS_URL").isNull())
F.col("ORDERS_VIA_ARTICLE")
).otherwise(F.lit(0)))
return df
However, here I get an error that:
pyspark.sql.utils.AnalysisException: grouping expressions sequence is empty, and '`!ri.foundry.main.transaction.123-123:ri.foundry.main.transaction.xxxx:master`.ORDERS' is not an aggregate function.
what am i missing out on?
As stated in error you cannot use aggreagte function without groupBy. In case of select it is used implicitly, you can think about it as groupBy().agg(sum(your_condition)) so it may look like this
df.groupBy().agg(
F.sum(
F.when(
(F.col("ORDERS_VIA_ARTICLE") > 0)
(F.col("ORDER_SUCCESS_URL") != "%16237890%")
& (F.col("ORDER_SUCCESS_URL") != "%30427132%")
& (F.col("ORDER_SUCCESS_URL") != "%242518801%")
| (F.col("ORDER_SUCCESS_URL").isNull())
F.col("ORDERS_VIA_ARTICLE"),
).otherwise(F.lit(0))
).alias("report_sum_orders_via_article")
Edit: if you want to use it as withColumn you may use window function
windowSpec = Window.partitionBy()
df.withColumn("report_sum_orders_via_article", F.sum(
F.when(
(F.col("ORDERS_VIA_ARTICLE") > 0)
(F.col("ORDER_SUCCESS_URL") != "%16237890%")
& (F.col("ORDER_SUCCESS_URL") != "%30427132%")
& (F.col("ORDER_SUCCESS_URL") != "%242518801%")
| (F.col("ORDER_SUCCESS_URL").isNull())
F.col("ORDERS_VIA_ARTICLE"),
).otherwise(F.lit(0))
).over(windowSpec))
With groupBy you need to join your results back to original df which of course still may be an option and may have better performance so in case of any problem my advise is to check both options
You cannot use groupBy and agg directly with "withColumn" function as it expects col and not dataframe
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.