Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Error being faced in PySpark:
pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n"
Data strucutre:
-- result_set: struct (nullable = true)
| |-- currency: string (nullable = true)
| |-- dates: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- date: string (nullable = true)
| | | |-- trackers: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- countries: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- country: string (nullable = true)
| | | | | | | |-- os_names: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- kpi_values: array (nullable = true)
| | | | | | | | | | |-- element: double (containsNull = true)
| | | | | | | | | |-- os_name: string (nullable = true)
| | | | | |-- token: string (nullable = true)
| |-- name: string (nullable = true)
| |-- token: string (nullable = true)
I am trying to create a view to show currency, date, and token:
df.select('result_set.currency', 'result_set.dates.date', 'result_set.dates.trackers.token').show()
Sample of data:
"result_set": {
"token": "abcdef",
"name": "Facebook",
"currency": "EUR",
"dates": [
"date": "2020-03-11",
"trackers": [
"token": "12345",
"countries": [
"country": "am",
"os_names": [
"os_name": "android",
"kpi_values": [
I am trying to create a view based on a few of these levels within the json data.
Update:
Duplicating token
df.selectExpr('result_set.currency','explode(result_set.dates)').\
select("*","col.*").\
selectExpr("explode(trackers)","*").\
selectExpr("currency","date","explode(trackers)").\
select("currency","date","col.*").\
selectExpr("currency","date","token", "explode(countries)").\
select("currency","date","token", "col.*").\
selectExpr("currency","date","token", "country", "explode(os_names)").\
select("currency","date","token", "country", "col.*").\
selectExpr("currency","date","token", "country", "os_name", "explode(kpi_values)").\
show(20)
After doing some explodes, now the token repeats 8 times.
As you are accessing array of structs
we need to give which element from array we need to access i.e 0,1,2..
etc.
if we need to select all elements of array then we need to use explode()
.
Example:
df.printSchema()
#root
# |-- result_set: struct (nullable = true)
# | |-- currency: string (nullable = true)
# | |-- dates: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- date: string (nullable = true)
# | | | |-- trackers: array (nullable = true)
# | | | | |-- element: struct (containsNull = true)
# | | | | | |-- countries: array (nullable = true)
# | | | | | | |-- element: struct (containsNull = true)
# | | | | | | | |-- country: string (nullable = true)
# | | | | | | | |-- os_names: array (nullable = true)
# | | | | | | | | |-- element: struct (containsNull = true)
# | | | | | | | | | |-- kpi_values: array (nullable = true)
# | | | | | | | | | | |-- element: long (containsNull = true)
# | | | | | | | | | |-- os_name: string (nullable = true)
# | | | | | |-- token: string (nullable = true)
# | |-- name: string (nullable = true)
# | |-- token: string (nullable = true)
#accessing token,date from array
df.selectExpr('result_set.dates.trackers[0].token','result_set.currency', 'result_set.dates.date').show()
#+--------------------------------------------------+--------+------------+
#|result_set.dates.trackers AS trackers#194[0].token|currency| date|
#+--------------------------------------------------+--------+------------+
#| [12345]| EUR|[2020-03-11]|
#+--------------------------------------------------+--------+------------+
#accessing first elements from dates, trackers array and extracting date,token values
df.selectExpr('result_set.dates[0].trackers[0].token as token','result_set.currency', 'result_set.dates[0].date as date').show()
#+-----+--------+----------+
#|token|currency| date|
#+-----+--------+----------+
#|12345| EUR|2020-03-11|
#+-----+--------+----------+
#if you need to select all elements of array then we need to explode the array and select the data
df.selectExpr('result_set.currency','explode(result_set.dates)').\
select("*","col.*").\
selectExpr("explode(trackers)","*").\
selectExpr("currency","date","explode(trackers)").\
select("currency","date","col.*").\
select("currency","date","token").\
show()
#+--------+----------+-----+
#|currency| date|token|
#+--------+----------+-----+
#| EUR|2020-03-11|12345|
#+--------+----------+-----+
–
–
–
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.