python - PySpark, importing schema through JSON file

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

tbschema.json looks like this:

[{"TICKET":"integer","TRANFERRED":"string","ACCOUNT":"STRING"}]
I load it using following code
>>> df2 = sqlContext.jsonFile("tbschema.json")
>>> f2.schema
StructType(List(StructField(ACCOUNT,StringType,true),
    StructField(TICKET,StringType,true),StructField(TRANFERRED,StringType,true)))
>>> df2.printSchema()
 |-- ACCOUNT: string (nullable = true)
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)
Why does the schema elements gets sorted, when I want the elements in the same order as they appear in the JSON.
The data type integer has been converted into StringType after the JSON has been derived, how do I retain the datatype.
  Why does the schema elements gets sorted, when i want the elemets in the same order as they appear in the json.
Because order of fields is not guaranteed. While it is not explicitly stated it becomes obvious when you take a look a the examples provided in the JSON reader doctstring. If you need specific ordering you can provide schema manually:
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
    StructField("TICKET", StringType(), True),
    StructField("TRANFERRED", StringType(), True),
    StructField("ACCOUNT", StringType(), True),
df2 = sqlContext.read.json("tbschema.json", schema)
df2.printSchema()
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)
 |-- ACCOUNT: string (nullable = true)
  The data type integer has been converted into StringType after the json has been derived, how do i retain the datatype.
Data type of JSON field TICKET is string hence JSON reader returns string. It is JSON reader not some-kind-of-schema reader.
Generally speaking you should consider some proper format which comes with schema support out-of-the-box, for example Parquet, Avro or Protocol Buffers. But if you really want to play with JSON you can define poor man's "schema" parser like this:
from collections import OrderedDict 
import json
with open("./tbschema.json") as fr:
    ds = fr.read()
items = (json
  .JSONDecoder(object_pairs_hook=OrderedDict)
  .decode(ds)[0].items())
mapping = {"string": StringType, "integer": IntegerType, ...}
schema = StructType([
    StructField(k, mapping.get(v.lower())(), True) for (k, v) in items])
Problem with JSON is that there is really no guarantee regarding fields ordering whatsoever, not to mention handling missing fields, inconsistent types and so on. So using solution as above really depends on how much you trust your data.
Alternatively you can use built-in schema  import / export utilities.
                interesting solution, any particular reason that the () after IntergerType() was moved to the creation of the Struct, and not in your map? That would allow you to do a DecimalType(x,y) .. thanks though, really helpful!
– m1nkeh
                Nov 13, 2019 at 11:39
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.