Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
t1.month,
(CASE WHEN t2.id is null then false else true end) as id_present,
(CASE WHEN ((NOT (t1.id IN (SELECT DISTINCT id
FROM all a
WHERE a.dates = current_date - Interval '2' day)))
((t1.click_date = current_date) OR ((t1.count > 0) AND (t1.click_date < (current_date - INTERVAL '12' MONTH)))))
THEN 'Action1'
WHEN ((NOT (t1.id IN (SELECT DISTINCT id
FROM all a
WHERE a.dates = current_date - Interval '2' day)))
AND (t1.count > 0)
AND (t1.click_date < (current_date - INTERVAL '12' MONTH)))
THEN 'Action2'
ELSE 'Action3' END) actions
FROM dataset1 t1 LEFT JOIN current t2 ON t1.id = t2.id
Now the error I am facing is
Exception in thread "main" org.apache.spark.sql.AnalysisException: IN/EXISTS predicate sub-queries can only be used in Filter/Join and a few commands:
This query is working fine in basic SQL but not in spark SQL and I understand that it is because of me using
NOT IN
in
CASE WHEN
. What is the alternative to this query in order to avoid this error?
–
In Spark,
If the number of distinct ids in
all
is small, you can use
collect_set()
function to create the distinct id array,
cross join
dataset1 and then
array_contains()
function to check whether id in distinct id array.
Here is the query:
SELECT
t1.id,
year,
month,
(CASE WHEN t2.id is null then false else true end) as id_present,
(CASE WHEN ((NOT array_contains(dia.distinct_id_array,t1.id)))
((click_date = current_date) OR ((count > 0) AND (click_date < (current_date - INTERVAL '12' MONTH)))))
THEN 'Action1'
WHEN ((NOT array_contains(dia.distinct_id_array,t1.id)))
AND (count > 0)
AND (click_date < (current_date - INTERVAL '12' MONTH)))
THEN 'Action2'
ELSE 'Action3' END
) as actions
FROM dataset1 t1
LEFT JOIN current t2
ON t1.id = t2.id
CROSS JOIN (
(SELECT collect_set(id) as distinct_id_array
FROM all
WHERE dates = current_date - Interval '2' day
) dia
If the number of distinct id in all
is big, you need to use left join with dataset1
and group by to generate a is_in_all flag for each id in t1.
Here is the query:
WITH id_flag (
SELECT
dataset1.id,
COUNT(all.id) > 0 as is_in_all
dataset1
LEFT JOIN
dataset1.id = all.id
GROUP BY dataset1.id
SELECT
t1.id,
year,
month,
(CASE WHEN t2.id is null then false else true end) as id_present,
(CASE WHEN ((NOT is_in_all))
((click_date = current_date) OR ((count > 0) AND (click_date < (current_date - INTERVAL '12' MONTH)))))
THEN 'Action1'
WHEN ((NOT is_in_all))
AND (count > 0)
AND (click_date < (current_date - INTERVAL '12' MONTH)))
THEN 'Action2'
ELSE 'Action3' END
) as actions
FROM dataset1 t1
LEFT JOIN current t2
ON t1.id = t2.id
JOIN id_flag
t1.id = id_flag.id
–
–
You could try to use this query. It's a modified version of your query, I've moved sub-queries in your CASE WHEN
to JOIN
clause, and replace NOT IN
with NOT EXISTS
expression.
SELECT
t1.id,
t1.year,
t1.day,
t1.month,
(CASE WHEN t2.id IS NULL THEN false ELSE true END) AS id_present,
(CASE WHEN (t3.id IS NOT NULL
((t1.click_date = current_date) OR ((t1.count > 0) AND (t1.click_date < (current_date - INTERVAL '12' MONTH)))))
THEN 'Action1'
WHEN (t3.id IS NOT NULL
AND (t1.count > 0)
AND (t1.click_date < (current_date - INTERVAL '12' MONTH)))
THEN 'Action2'
ELSE 'Action3' END) actions
FROM dataset1 t1
LEFT JOIN current t2 ON t1.id = t2.id
LEFT JOIN
(SELECT id
FROM dataset1 t
WHERE NOT EXISTS
(SELECT 1
FROM all a
WHERE t.id = a.id AND a.dates = current_date - Interval '2' day)
) t3 ON t1.id = t3.id
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.