parse array of dictionaries from JSON with Spark

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
+----------+-----+------+-----------+
|       day|spend|clicks|impressions|
+----------+-----+------+-----------+
|2018-06-11| 84.0|   428|      14778|
|2018-06-12| 10.0|    18|       1778|
+----------+-----+------+-----------+
In regular python I can just do this:
response = requests.get(url).json()
df = pd.DataFrame(response['data'])
But the solution must work in AWS Glue, and Pandas is unwelcome there. A solid day's searching has been fruitless. Some highlights:
Many suggest parallelizing it first, then turning the RDD into a dataframe:
response = requests.get(url).json()
rdd = sc.parallelize(response)
df = rdd.toDF()
But that results in:
  TypeError: Can not infer schema for type: 
Others say this should bear fruit:
response = requests.get(url)
df = sqlContext.createDataFrame([json.loads(line) for line in response.iter_lines()])
But it results in this dataframe, which resists all attempts to parse:
 |-- data: array (nullable = true) 
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |-- page: long (nullable = true)
 |-- total_pages: long (nullable = true)
+--------------------+----+-----------+
|                data|page|total_pages|
+--------------------+----+-----------+
|[Map(impressions ...|   1|          1|
+--------------------+----+-----------+
It looks like you have almost solved it. In the snippet
response = requests.get(url).json()
rdd = sc.parallelize(response)
df = rdd.toDF()
The response is a nested JSON object and somehow dataframe is not able to infer the schema. Based on my understanding, you don't need the schema of the entire response object. What you are looking for is just the data field of response object.
So below snippet loads the data in the expected format:
response = requests.get(url).json()
rdd = sc.parallelize(response['data'])
df = rdd.toDF()
This same issue is there in your second approach as well.
Also, when creating the Dataframe using createDataFrame you don't need to load the line again as it will already be a json/dict object.
Below snippet should work:
response = requests.get(url).json()
df = sqlContext.createDataFrame([line for line in response['data']])
                First suggestion needed adjusting to rdd.toDF(sampleRatio=0.5), but that's due to aspects of the source data that I left out of the question to keep things simple, so thanks!  The second gives TypeError: 'Response' object has no attribute '__getitem__' but whatever.
– user1315792
                Aug 22, 2018 at 16:02
                Second solution had missing .json(). I just edited the answer. Thanks for pointing it out!
– code
                Aug 22, 2018 at 18:40
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.