Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
+----------+-----+------+-----------+
| day|spend|clicks|impressions|
+----------+-----+------+-----------+
|2018-06-11| 84.0| 428| 14778|
|2018-06-12| 10.0| 18| 1778|
+----------+-----+------+-----------+
In regular python I can just do this:
response = requests.get(url).json()
df = pd.DataFrame(response['data'])
But the solution must work in AWS Glue, and Pandas is unwelcome there. A solid day's searching has been fruitless. Some highlights:
Many suggest parallelizing it first, then turning the RDD into a dataframe:
response = requests.get(url).json()
rdd = sc.parallelize(response)
df = rdd.toDF()
But that results in:
TypeError: Can not infer schema for type:
Others say this should bear fruit:
response = requests.get(url)
df = sqlContext.createDataFrame([json.loads(line) for line in response.iter_lines()])
But it results in this dataframe, which resists all attempts to parse:
|-- data: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- page: long (nullable = true)
|-- total_pages: long (nullable = true)
+--------------------+----+-----------+
| data|page|total_pages|
+--------------------+----+-----------+
|[Map(impressions ...| 1| 1|
+--------------------+----+-----------+
It looks like you have almost solved it. In the snippet
response = requests.get(url).json()
rdd = sc.parallelize(response)
df = rdd.toDF()
The response
is a nested JSON object and somehow dataframe is not able to infer the schema. Based on my understanding, you don't need the schema of the entire response object. What you are looking for is just the data
field of response
object.
So below snippet loads the data in the expected format:
response = requests.get(url).json()
rdd = sc.parallelize(response['data'])
df = rdd.toDF()
This same issue is there in your second approach as well.
Also, when creating the Dataframe using createDataFrame
you don't need to load the line
again as it will already be a json/dict object.
Below snippet should work:
response = requests.get(url).json()
df = sqlContext.createDataFrame([line for line in response['data']])
–
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.