python - cannot resolve column due to data type mismatch PySpark

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Error being faced in PySpark:
pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n"
Data strucutre:
-- result_set: struct (nullable = true)
 |    |-- currency: string (nullable = true)
 |    |-- dates: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- date: string (nullable = true)
 |    |    |    |-- trackers: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- countries: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- country: string (nullable = true)
 |    |    |    |    |    |    |    |-- os_names: array (nullable = true)
 |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- kpi_values: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- os_name: string (nullable = true)
 |    |    |    |    |    |-- token: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- token: string (nullable = true)
I am trying to create a view to show currency, date, and token:
df.select('result_set.currency', 'result_set.dates.date', 'result_set.dates.trackers.token').show()
Sample of data:
"result_set": {
        "token": "abcdef",
        "name": "Facebook",
        "currency": "EUR",
        "dates": [
                "date": "2020-03-11",
                "trackers": [
                        "token": "12345",
                        "countries": [
                                "country": "am",
                                "os_names": [
                                        "os_name": "android",
                                        "kpi_values": [
I am trying to create a view based on a few of these levels within the json data.
Update:
Duplicating token
df.selectExpr('result_set.currency','explode(result_set.dates)').\
select("*","col.*").\
selectExpr("explode(trackers)","*").\
selectExpr("currency","date","explode(trackers)").\
select("currency","date","col.*").\
selectExpr("currency","date","token", "explode(countries)").\
select("currency","date","token", "col.*").\
selectExpr("currency","date","token", "country", "explode(os_names)").\
select("currency","date","token", "country", "col.*").\
selectExpr("currency","date","token", "country", "os_name", "explode(kpi_values)").\
show(20)
After doing some explodes, now the token repeats 8 times.
As you are accessing array of structs we need to give which element from array we need to access i.e 0,1,2..etc.
if we need to select all elements of array then we need to use explode().
Example:
df.printSchema()
#root
# |-- result_set: struct (nullable = true)
# |    |-- currency: string (nullable = true)
# |    |-- dates: array (nullable = true)
# |    |    |-- element: struct (containsNull = true)
# |    |    |    |-- date: string (nullable = true)
# |    |    |    |-- trackers: array (nullable = true)
# |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |-- countries: array (nullable = true)
# |    |    |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |    |    |-- country: string (nullable = true)
# |    |    |    |    |    |    |    |-- os_names: array (nullable = true)
# |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |    |    |    |    |-- kpi_values: array (nullable = true)
# |    |    |    |    |    |    |    |    |    |    |-- element: long (containsNull = true)
# |    |    |    |    |    |    |    |    |    |-- os_name: string (nullable = true)
# |    |    |    |    |    |-- token: string (nullable = true)
# |    |-- name: string (nullable = true)
# |    |-- token: string (nullable = true)
#accessing token,date from array
df.selectExpr('result_set.dates.trackers[0].token','result_set.currency', 'result_set.dates.date').show()
#+--------------------------------------------------+--------+------------+
#|result_set.dates.trackers AS trackers#194[0].token|currency|        date|
#+--------------------------------------------------+--------+------------+
#|                                           [12345]|     EUR|[2020-03-11]|
#+--------------------------------------------------+--------+------------+
#accessing first elements from dates, trackers array and extracting date,token values
df.selectExpr('result_set.dates[0].trackers[0].token as token','result_set.currency', 'result_set.dates[0].date as date').show()
#+-----+--------+----------+
#|token|currency|      date|
#+-----+--------+----------+
#|12345|     EUR|2020-03-11|
#+-----+--------+----------+
#if you need to select all elements of array then we need to explode the array and select the data
df.selectExpr('result_set.currency','explode(result_set.dates)').\
select("*","col.*").\
selectExpr("explode(trackers)","*").\
selectExpr("currency","date","explode(trackers)").\
select("currency","date","col.*").\
select("currency","date","token").\
show()
#+--------+----------+-----+
#|currency|      date|token|
#+--------+----------+-----+
#|     EUR|2020-03-11|12345|
#+--------+----------+-----+
                @dataviews, Try this link spark.apache.org/docs/2.4.4/api/sql/index.html documented all the spark functions!
– notNull
                Mar 12, 2020 at 2:08
                how could i access all of the nested arrays? is there a way to use a loop for the arrays that I'm indexing? @Shu
– dataviews
                Mar 12, 2020 at 2:32
                @dataviews, in built spark doesn't have that function yet!, refer to these links for dynamic flattening... stackoverflow.com/questions/37471346/… stackoverflow.com/questions/34271398/… stackoverflow.com/questions/46178325/…
– notNull
                Mar 12, 2020 at 2:51
                this is one of the data structures I'm trying to unpack: [[[zz, [[[1347.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 7.0E-4, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], android]]]],
– dataviews
                Mar 12, 2020 at 3:05
                @dataviews, if it is an array type then try with explode (or) post as a new question, so that it will be visible to  all members of community to get helped quickly!
– notNull
                Mar 12, 2020 at 3:24
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.