apache spark - could not read data from json using pyspark

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am new in PySpark . can anyone help me how to read json data using pyspark. what we have done,

(1) main.py

import os.path
from pyspark.sql import SparkSession
def fileNameInput(filename,spark):
        if(os.path.isfile(filename)):
            loadFileIntoHdfs(filename,spark)
        else:
            print("File does not exists")
    except OSError:
        print("Error while finding file")
def loadFileIntoHdfs(fileName,spark):
    df = spark.read.json(fileName)
    df.show()
if __name__ == '__main__':
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    file_name = input("Enter file location : ")
    fileNameInput(file_name,spark)
When I run above code it throws error message
 File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o41.showString.
: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
Thanks in advance
                { 	"employees": [{ 			"firstName": "John", 			"lastName": "Doe" 		}, 		{ 			"firstName": "Anna", 			"lastName": "Smith" 		}, 		{ 			"firstName": "Peter", 			"lastName": "Jones" 		} 	] }
– Prashant Patel
                Mar 22, 2018 at 11:45
Your JSON works in my pyspark. I can get a similar error when the record text goes across multiple lines. Please ensure that each record fits in one line.
Alternatively, tell it to support multi-line records:
spark.read.json(filename, multiLine=True)
What works:
{ "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }
That outputs:
spark.read.json('/home/ernest/Desktop/brokenjson.json').printSchema()
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
When I try some input like this:
  "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }
Then I get the corrupt record in schema:
 |-- _corrupt_record: string (nullable = true)
But when used with multiline options, the latter input works too.
                you are my hero, spent ages trying to figure this out - strangely using .option('multiline', 'true') didnt work for me!
– Umar.H
                Dec 17, 2019 at 17:39
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.