相关文章推荐
飘逸的四季豆  ·  记一次Razor ...·  5 月前    · 
笑点低的西红柿  ·  Android webView ...·  1 年前    · 
性感的煎鸡蛋  ·  js base64 转 ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams (F.col("ORDER_SUCCESS_URL") != "%16237890%") & (F.col("ORDER_SUCCESS_URL") != "%30427132%") & (F.col("ORDER_SUCCESS_URL") != "%242518801%") | (F.col("ORDER_SUCCESS_URL").isNull()) F.col("ORDERS_VIA_ARTICLE") ).otherwise(F.lit(0)) F.lit(0) ).alias("report_sum_orders_via_article") return df

Now, I wanted to use the same logic with df.withColumn() instead of df.select().

I tried this (removed the coalesce for now):

def dev_prev_month(clean_joined_traffic_data):
    df = clean_joined_traffic_data
    df = df.withColumn(
        "report_sum_orders_via_article",_sum(
                F.when(
                    (F.col("ORDERS_VIA_ARTICLE") > 0) &
                        (F.col("ORDER_SUCCESS_URL") != "%16237890%") &
                        (F.col("ORDER_SUCCESS_URL") != "%30427132%") &
                        (F.col("ORDER_SUCCESS_URL") != "%242518801%") |
                        (F.col("ORDER_SUCCESS_URL").isNull())
                    F.col("ORDERS_VIA_ARTICLE")
                ).otherwise(F.lit(0)))
    return df

However, here I get an error that:

pyspark.sql.utils.AnalysisException: grouping expressions sequence is empty, and '`!ri.foundry.main.transaction.123-123:ri.foundry.main.transaction.xxxx:master`.ORDERS' is not an aggregate function.

what am i missing out on?

As stated in error you cannot use aggreagte function without groupBy. In case of select it is used implicitly, you can think about it as groupBy().agg(sum(your_condition)) so it may look like this

df.groupBy().agg(
    F.sum(
        F.when(
            (F.col("ORDERS_VIA_ARTICLE") > 0)
                (F.col("ORDER_SUCCESS_URL") != "%16237890%")
                & (F.col("ORDER_SUCCESS_URL") != "%30427132%")
                & (F.col("ORDER_SUCCESS_URL") != "%242518801%")
                | (F.col("ORDER_SUCCESS_URL").isNull())
            F.col("ORDERS_VIA_ARTICLE"),
        ).otherwise(F.lit(0))
    ).alias("report_sum_orders_via_article")

Edit: if you want to use it as withColumn you may use window function

windowSpec  = Window.partitionBy()
df.withColumn("report_sum_orders_via_article", F.sum(
        F.when(
            (F.col("ORDERS_VIA_ARTICLE") > 0)
                (F.col("ORDER_SUCCESS_URL") != "%16237890%")
                & (F.col("ORDER_SUCCESS_URL") != "%30427132%")
                & (F.col("ORDER_SUCCESS_URL") != "%242518801%")
                | (F.col("ORDER_SUCCESS_URL").isNull())
            F.col("ORDERS_VIA_ARTICLE"),
        ).otherwise(F.lit(0))
    ).over(windowSpec))

With groupBy you need to join your results back to original df which of course still may be an option and may have better performance so in case of any problem my advise is to check both options

You cannot use groupBy and agg directly with "withColumn" function as it expects col and not dataframe

you're using alias here. I want to use the withColumn syntax. could you modify it accordingly? – x89 Feb 23 at 11:45

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.