Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I need to implement a function similar to redshift listagg() within group (order by x_column) but it is important to be in Spark SQL, the one here
https://spark.apache.org/docs/2.4.0/api/sql/
There is
this
similar question but the answer is not no SQL.
My query on Redshift SQL would be:
select KEY,
listagg(CODE, '-') within group (order by DATE) as CODE
from demo_table
group by KEY
At this point it is not important the order by statement, only aggregating all the columns with the group by would be enough, I have tried with concat_ws and it does not work as expected
It won't work for me to have it on pyspark
–
array_join( -- concat the array
collect_list(code), -- aggregate that collects the array of [code]
' - ' -- delimiter
from demo_table
group by KEY
Below query will work. It includes order by as well. Please check.
spark.sql("""select key,max(code)
from ( select key,array_join(collect_list(code) over (partition by key order by to_date(date,'m/d/yyyy')),'-') code from view)
group by key""").show(100)
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.