相关文章推荐
被表白的八宝粥  ·  Python ...·  1 月前    · 
强健的回锅肉  ·  xml:space(xml:空间) ...·  1 年前    · 
欢乐的灭火器  ·  xaml - Handling PRISM ...·  1 年前    · 
潇洒的山楂  ·  Swagger-codegen - ...·  1 年前    · 
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams
+----------+-----+------+-----------+
|       day|spend|clicks|impressions|
+----------+-----+------+-----------+
|2018-06-11| 84.0|   428|      14778|
|2018-06-12| 10.0|    18|       1778|
+----------+-----+------+-----------+

In regular python I can just do this:

response = requests.get(url).json()
df = pd.DataFrame(response['data'])

But the solution must work in AWS Glue, and Pandas is unwelcome there. A solid day's searching has been fruitless. Some highlights:

Many suggest parallelizing it first, then turning the RDD into a dataframe:

response = requests.get(url).json()
rdd = sc.parallelize(response)
df = rdd.toDF()

But that results in:

TypeError: Can not infer schema for type:

Others say this should bear fruit:

response = requests.get(url)
df = sqlContext.createDataFrame([json.loads(line) for line in response.iter_lines()])

But it results in this dataframe, which resists all attempts to parse:

|-- data: array (nullable = true) | |-- element: map (containsNull = true) | | |-- key: string | | |-- value: string (valueContainsNull = true) |-- page: long (nullable = true) |-- total_pages: long (nullable = true) +--------------------+----+-----------+ | data|page|total_pages| +--------------------+----+-----------+ |[Map(impressions ...| 1| 1| +--------------------+----+-----------+

It looks like you have almost solved it. In the snippet

response = requests.get(url).json()
rdd = sc.parallelize(response)
df = rdd.toDF()

The response is a nested JSON object and somehow dataframe is not able to infer the schema. Based on my understanding, you don't need the schema of the entire response object. What you are looking for is just the data field of response object.

So below snippet loads the data in the expected format:

response = requests.get(url).json()
rdd = sc.parallelize(response['data'])
df = rdd.toDF()

This same issue is there in your second approach as well. Also, when creating the Dataframe using createDataFrame you don't need to load the line again as it will already be a json/dict object. Below snippet should work:

response = requests.get(url).json()
df = sqlContext.createDataFrame([line for line in response['data']])
                First suggestion needed adjusting to rdd.toDF(sampleRatio=0.5), but that's due to aspects of the source data that I left out of the question to keep things simple, so thanks!  The second gives TypeError: 'Response' object has no attribute '__getitem__' but whatever.
– user1315792
                Aug 22, 2018 at 16:02
                Second solution had missing .json(). I just edited the answer. Thanks for pointing it out!
– code
                Aug 22, 2018 at 18:40
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.