Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a JSON file that looks like this:

{"id":{"0":0,"1":1,"2":2,"3":3}, "name":{"0":"name0","1":"name1","2":"name2","3":"name3"}}

When I read it using PySpark like:

names = spark.read.json('data/names.json')

I get all the rows into a single one, like this:

| id| name| +--------------+--------------------+ |{0, 1, 2, 3...|{name1, name2, name3...

How can I read it so that the values are on multiple rows?

A quick hack can be to read the json with pandas like this:pandas_df = pandas.read_json('data/names.json') and then load it in spark spark_df = spark.createDataFrame(pandas_df). For more comprehensive analysis of the problem check this.

This is an alternative, more native Spark solution.

First, explode_outer to explode the id column and then get the corresponding name value.

schema = StructType([
  StructField('id', MapType(StringType(), IntegerType())),
  StructField('name', MapType(StringType(), StringType()))
df = spark.read.json('data/names.json', schema=schema)
df = (df.select(F.explode_outer('id').alias('id_k', 'id_v'), 'name')
    .withColumn('name', F.col('name').getItem(F.col('id_v'))))
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.