Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Error being faced in PySpark:

pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n"

Data strucutre:

-- result_set: struct (nullable = true)
 |    |-- currency: string (nullable = true)
 |    |-- dates: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- date: string (nullable = true)
 |    |    |    |-- trackers: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- countries: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- country: string (nullable = true)
 |    |    |    |    |    |    |    |-- os_names: array (nullable = true)
 |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- kpi_values: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |    |    |    |    |    |    |-- os_name: string (nullable = true)
 |    |    |    |    |    |-- token: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- token: string (nullable = true)

I am trying to create a view to show currency, date, and token:

df.select('result_set.currency', 'result_set.dates.date', 'result_set.dates.trackers.token').show()

Sample of data:

"result_set": {
        "token": "abcdef",
        "name": "Facebook",
        "currency": "EUR",
        "dates": [
                "date": "2020-03-11",
                "trackers": [
                        "token": "12345",
                        "countries": [
                                "country": "am",
                                "os_names": [
                                        "os_name": "android",
                                        "kpi_values": [

I am trying to create a view based on a few of these levels within the json data.

Update:

Duplicating token

df.selectExpr('result_set.currency','explode(result_set.dates)').\
select("*","col.*").\
selectExpr("explode(trackers)","*").\
selectExpr("currency","date","explode(trackers)").\
select("currency","date","col.*").\
selectExpr("currency","date","token", "explode(countries)").\
select("currency","date","token", "col.*").\
selectExpr("currency","date","token", "country", "explode(os_names)").\
select("currency","date","token", "country", "col.*").\
selectExpr("currency","date","token", "country", "os_name", "explode(kpi_values)").\
show(20)

After doing some explodes, now the token repeats 8 times.

As you are accessing array of structs we need to give which element from array we need to access i.e 0,1,2..etc.

  • if we need to select all elements of array then we need to use explode().
  • Example:

    df.printSchema()
    #root
    # |-- result_set: struct (nullable = true)
    # |    |-- currency: string (nullable = true)
    # |    |-- dates: array (nullable = true)
    # |    |    |-- element: struct (containsNull = true)
    # |    |    |    |-- date: string (nullable = true)
    # |    |    |    |-- trackers: array (nullable = true)
    # |    |    |    |    |-- element: struct (containsNull = true)
    # |    |    |    |    |    |-- countries: array (nullable = true)
    # |    |    |    |    |    |    |-- element: struct (containsNull = true)
    # |    |    |    |    |    |    |    |-- country: string (nullable = true)
    # |    |    |    |    |    |    |    |-- os_names: array (nullable = true)
    # |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
    # |    |    |    |    |    |    |    |    |    |-- kpi_values: array (nullable = true)
    # |    |    |    |    |    |    |    |    |    |    |-- element: long (containsNull = true)
    # |    |    |    |    |    |    |    |    |    |-- os_name: string (nullable = true)
    # |    |    |    |    |    |-- token: string (nullable = true)
    # |    |-- name: string (nullable = true)
    # |    |-- token: string (nullable = true)
    #accessing token,date from array
    df.selectExpr('result_set.dates.trackers[0].token','result_set.currency', 'result_set.dates.date').show()
    #+--------------------------------------------------+--------+------------+
    #|result_set.dates.trackers AS trackers#194[0].token|currency|        date|
    #+--------------------------------------------------+--------+------------+
    #|                                           [12345]|     EUR|[2020-03-11]|
    #+--------------------------------------------------+--------+------------+
    #accessing first elements from dates, trackers array and extracting date,token values
    df.selectExpr('result_set.dates[0].trackers[0].token as token','result_set.currency', 'result_set.dates[0].date as date').show()
    #+-----+--------+----------+
    #|token|currency|      date|
    #+-----+--------+----------+
    #|12345|     EUR|2020-03-11|
    #+-----+--------+----------+
    #if you need to select all elements of array then we need to explode the array and select the data
    df.selectExpr('result_set.currency','explode(result_set.dates)').\
    select("*","col.*").\
    selectExpr("explode(trackers)","*").\
    selectExpr("currency","date","explode(trackers)").\
    select("currency","date","col.*").\
    select("currency","date","token").\
    show()
    #+--------+----------+-----+
    #|currency|      date|token|
    #+--------+----------+-----+
    #|     EUR|2020-03-11|12345|
    #+--------+----------+-----+
                    @dataviews, Try this link spark.apache.org/docs/2.4.4/api/sql/index.html documented all the spark functions!
    – notNull
                    Mar 12, 2020 at 2:08
                    how could i access all of the nested arrays? is there a way to use a loop for the arrays that I'm indexing? @Shu
    – dataviews
                    Mar 12, 2020 at 2:32
                    @dataviews, in built spark doesn't have that function yet!, refer to these links for dynamic flattening... stackoverflow.com/questions/37471346/… stackoverflow.com/questions/34271398/… stackoverflow.com/questions/46178325/…
    – notNull
                    Mar 12, 2020 at 2:51
                    this is one of the data structures I'm trying to unpack: [[[zz, [[[1347.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 7.0E-4, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], android]]]],
    – dataviews
                    Mar 12, 2020 at 3:05
                    @dataviews, if it is an array type then try with explode (or) post as a new question, so that it will be visible to  all members of community to get helped quickly!
    – notNull
                    Mar 12, 2020 at 3:24
            

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.