python - Pyspark JSON object or file to RDD

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am new to PySpark and I have an AskReddit json file which I got from this link . I am trying to create an RDD which I then hope to perform operation such as map and flatmap. I was advised to get the json in a jsonlines format but despite using pip to install jsonlines, I am unable to import the package in the PySpark notebook. Below is what I have tried for reading in the json.

In [10]: import json
         data = json.load(open("AskReddit.json", "r"))
         jsonrdd = sc.parallelize(data)
         jsonrdd.take(5) 
Out[11]: [u'kind', u'data']
I also tried to do the following which gives me the whole contents of the json file after doing jsonrdd.take(1).
In [6]: jsonrdd = sc.wholeTextFiles("*.json")
        jsonrdd.take(1)
However, I would like to get each json object as one line in the rdd. How would I go about this?
You can use SparkSQL's read.json to read the file like - 
jdf = spark.read.json("path/to/AskReddit.json")
and perform all kind of sql type of operation, even rdd type of operations on it.But json struture is really really nested with no fixed columns that can be derived with something like explode
You are better off with using read.json and using dataframe like- 
.withColumn('after',jdf.data.after)\
.withColumn('before',jdf.data.before)\
.withColumn('modhash',jdf.data.modhash)\
.withColumn('NestedKind',explode(jdf.data.children.kind))\
.withColumn('subreddit',explode(jdf.data.children.data.subreddit))\
.withColumn('clicked',explode(jdf.data.children.data.clicked))\
.show()
+--------------------+-------+---------+------+--------------------+----------+---------+-------+
|                data|   kind|    after|before|             modhash|NestedKind|subreddit|clicked|
+--------------------+-------+---------+------+--------------------+----------+---------+-------+
|[t3_66qv3r,null,W...|Listing|t3_66qv3r|  null|3r7ao0m7qiadae13d...|        t3|AskReddit|  false|
|[t3_66qv3r,null,W...|Listing|t3_66qv3r|  null|3r7ao0m7qiadae13d...|        t3|AskReddit|  false|
Use SparkContext library and read file as text and then map it to json.
from pyspark import SparkContext
sc = SparkContext("local","task").getOrCreate()
rddfile = sc.textFile(filename).map(lambda x: json.loads(x))
Assuming you are using spark 2.0+ you can do the following:
df = spark.read.json(filename).rdd
Check out the documentation for pyspark.sql.DataFrameReader.json for more details. Note this method expects a JSON lines format or a new-lines delimited JSON as I believe you mention you have.
                Actually I need to get it in JSON lines format. I used pip install jsonlines. However, when I try to import the package in the jupyter notebook running spark it says no module found even though I can import when running python in the terminal.
– dreamin
                Apr 22, 2017 at 1:13
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.

推荐文章

爱喝酒的篮球 · 编译出错libicui18n.so.58: undefined reference to `__cxa_throw_bad_array_new_length@CXXABI_1.3.8'_libicui

1 周前

低调的斑马 · python - Python中的最大递归深度是多少，如何增加它？ -

6 天前

闷骚的鸡蛋面 · python中colorbar设置范围_mob649e81576de1的技术博客_

5 天前

骑白马的乒乓球 · pytorch清华镜像源pip jupyter清华镜像_mob6454cc6bf0b7的技术博客_

3 天前

八块腹肌的盒饭 · 【全集】孙兴华《中文讲Python从入门到办公自动化》excel、word、ppt、PDF等 Python自动化 Python办公自动化 Python自动化办公__bilibili

2 天前

爽快的生姜 · 深入理解Neutron网络 - Viviane未完 - 博客园

5 月前

伤情的红豆 · js 将局部变量赋值给全局变量_js局部变量赋值给全局变量-CSDN博客

1 年前

开朗的楼梯 · python - AttributeError: 'NoneType' object has no attribute 'string - Stack Overflow

1 年前

酒量大的刺猬 · java 中的空格的转义字符是什么_百度知道

1 年前

才高八斗的哑铃 · 【Docker学习笔记二】Docker安装、运行流程与常用命令-阿里云开发者社区

1 年前