避免在Spark中解析json子字段

0 人关注

我有一个带有复杂模式的json文件(见下文),我正在用Spark读取。我发现有些字段在源数据中是重复的,因此Spark在读取过程中抛出了一个错误(如预期)。重复的名字是在 storageidlist 字段下。我想做的是将 storageidlist 字段作为一个未解析的字符串加载到一个字符串类型的列中,之后再手动解析。这在Spark中可能吗?

|-- errorcode: string (nullable = true ) |-- errormessage: string (nullable = true ) |-- ip: string (nullable = true ) |-- label: string (nullable = true ) |-- status: string (nullable = true ) |-- storageidlist: array (nullable = true ) | |-- element: struct (containsNull = true ) | | |-- errorcode: string (nullable = true ) | | |-- errormessage: string (nullable = true ) | | |-- fedirectorList: array (nullable = true ) | | | |-- element: struct (containsNull = true ) | | | | |-- directorId: string (nullable = true ) | | | | |-- errorcode: string (nullable = true ) | | | | |-- errordesc: string (nullable = true ) | | | | |-- metrics: string (nullable = true ) | | | | |-- portMetricDataList: array (nullable = true ) | | | | | |-- element: array (containsNull = true ) | | | | | | |-- element: struct (containsNull = true ) | | | | | | | |-- data: array (nullable = true ) | | | | | | | | |-- element: struct (containsNull = true ) | | | | | | | | | |-- ts: string (nullable = true ) | | | | | | | | | |-- value: string (nullable = true ) | | | | | | | |-- errorcode: string (nullable = true ) | | | | | | | |-- errordesc: string (nullable = true ) | | | | | | | |-- metricid: string (nullable = true ) | | | | | | | |-- portid: string (nullable = true ) | | | | | | | |-- status: string (nullable = true ) | | | | |-- status: string (nullable = true ) | | |-- metrics: string (nullable = true ) | | |-- status: string (nullable = true ) | | |-- storageGroupList: string (nullable = true ) | | |-- storageid: string (nullable = true ) |-- sublabel: string (nullable = true ) |-- ts: string (nullable = true )
json
apache-spark
schema
tothsa
tothsa
发布于 2021-11-17
1 个回答
Neethu Lalitha
Neethu Lalitha
发布于 2021-11-17
0 人赞同

其中一个选择是为这个JSON对象创建一个Java类。 这样,你就可以读取输入的JSON,而Spark不会在读取过程中抛出一个错误。只要你定义的模式与输入模式相匹配,就允许重复。

    spark.read()
            .schema(Encoders.bean(YourPOJO.class).schema())
            .option("encoding", "UTF-8")
            .option("mode", "FAILFAST")