相关文章推荐
高大的佛珠  ·  How to set a spark ...·  1 年前    · 
豁达的黄豆  ·  当HPC碰撞AI ...·  1 年前    · 
卖萌的自行车  ·  如何通过MVC ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams t1.month, (CASE WHEN t2.id is null then false else true end) as id_present, (CASE WHEN ((NOT (t1.id IN (SELECT DISTINCT id FROM all a WHERE a.dates = current_date - Interval '2' day))) ((t1.click_date = current_date) OR ((t1.count > 0) AND (t1.click_date < (current_date - INTERVAL '12' MONTH))))) THEN 'Action1' WHEN ((NOT (t1.id IN (SELECT DISTINCT id FROM all a WHERE a.dates = current_date - Interval '2' day))) AND (t1.count > 0) AND (t1.click_date < (current_date - INTERVAL '12' MONTH))) THEN 'Action2' ELSE 'Action3' END) actions FROM dataset1 t1 LEFT JOIN current t2 ON t1.id = t2.id

Now the error I am facing is Exception in thread "main" org.apache.spark.sql.AnalysisException: IN/EXISTS predicate sub-queries can only be used in Filter/Join and a few commands:

This query is working fine in basic SQL but not in spark SQL and I understand that it is because of me using NOT IN in CASE WHEN . What is the alternative to this query in order to avoid this error?

Please update your query to explicit show which table these columns belong to: 'dates, count, click_date, current_date' Trung Duong Mar 16 at 13:32

In Spark,

  • If the number of distinct ids in all is small, you can use collect_set() function to create the distinct id array, cross join dataset1 and then array_contains() function to check whether id in distinct id array.
  • Here is the query:

    SELECT
         t1.id,
         year,
         month,
         (CASE WHEN t2.id is null then false else true end) as id_present,
         (CASE WHEN ((NOT array_contains(dia.distinct_id_array,t1.id))) 
                     ((click_date = current_date) OR ((count > 0) AND (click_date < (current_date - INTERVAL  '12' MONTH)))))
                THEN 'Action1' 
                WHEN ((NOT array_contains(dia.distinct_id_array,t1.id)))
                        AND (count > 0) 
                        AND (click_date < (current_date - INTERVAL  '12' MONTH))) 
                THEN 'Action2' 
                ELSE 'Action3' END
        ) as actions
    FROM dataset1 t1 
    LEFT JOIN current t2 
    ON t1.id = t2.id
    CROSS JOIN (
    (SELECT collect_set(id) as distinct_id_array
     FROM all
     WHERE dates = current_date - Interval '2' day
    ) dia
    
  • If the number of distinct id in all is big, you need to use left join with dataset1 and group by to generate a is_in_all flag for each id in t1.
  • Here is the query:

    WITH id_flag (
    SELECT
        dataset1.id,
        COUNT(all.id) > 0 as is_in_all
        dataset1
    LEFT JOIN
        dataset1.id = all.id
    GROUP BY dataset1.id
    SELECT
         t1.id,
         year,
         month,
         (CASE WHEN t2.id is null then false else true end) as id_present,
         (CASE WHEN ((NOT is_in_all)) 
                     ((click_date = current_date) OR ((count > 0) AND (click_date < (current_date - INTERVAL  '12' MONTH)))))
                THEN 'Action1' 
                WHEN ((NOT is_in_all))
                        AND (count > 0) 
                        AND (click_date < (current_date - INTERVAL  '12' MONTH))) 
                THEN 'Action2' 
                ELSE 'Action3' END
        ) as actions
    FROM dataset1 t1 
    LEFT JOIN current t2 
    ON t1.id = t2.id
    JOIN id_flag
       t1.id = id_flag.id
                    Thank you for your response. How do you define small and big here? If we do collect_set,  this set would contain around 200K ids.
    – Mikasa
                    Mar 17 at 3:51
                    Hi @Mikasa, this depends on your dev env. 10K should be a safe threshold to distinguish small and big in most env.
    – Junjie
                    Mar 17 at 9:02
    

    You could try to use this query. It's a modified version of your query, I've moved sub-queries in your CASE WHEN to JOIN clause, and replace NOT IN with NOT EXISTS expression.

    SELECT
         t1.id,
         t1.year,
         t1.day,
         t1.month,
         (CASE WHEN t2.id IS NULL THEN false ELSE true END) AS id_present,
         (CASE WHEN (t3.id IS NOT NULL  
                     ((t1.click_date = current_date) OR ((t1.count > 0) AND (t1.click_date < (current_date - INTERVAL  '12' MONTH)))))
                THEN 'Action1' 
                WHEN (t3.id IS NOT NULL
                      AND (t1.count > 0) 
                      AND (t1.click_date < (current_date - INTERVAL  '12' MONTH))) 
                THEN 'Action2' 
                ELSE 'Action3' END) actions
    FROM dataset1 t1 
         LEFT JOIN current t2 ON t1.id = t2.id
         LEFT JOIN 
           (SELECT id
            FROM   dataset1 t
            WHERE  NOT EXISTS 
                  (SELECT 1
                   FROM all a
                   WHERE t.id = a.id AND a.dates = current_date - Interval '2' day)
           ) t3 ON t1.id = t3.id
                    Yes, since we only need to check ID in table dataset1 which is not existed in table all, so it only needs to SELECT 1
    – Trung Duong
                    Mar 17 at 4:41
            

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.