Spark SQL的Parquet那些事儿开发者社区

Spark SQL的Parquet那些事儿

// Encoders for most common types are automatically provided by importing spark.implicits._import spark.implicits._
val peopleDF = spark.read.json("examples/src/main/resources/people.json")
// DataFrames can be saved as Parquet files, maintaining the schema informationpeopleDF.write.parquet("people.parquet")
// Read in the parquet file created above// Parquet files are self-describing so the schema is preserved// The result of loading a Parquet file is also a DataFrameval parquetFileDF = spark.read.parquet("people.parquet")
// Parquet files can also be used to create a temporary view and then used in SQL statementsparquetFileDF.createOrReplaceTempView("parquetFile")val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")namesDF.map(attributes => "Name: " + attributes(0)).show()// +------------+// |       value|// +------------+// |Name: Justin|// +------------+

path└── to   └── table       ├── gender=male       │   ├── ...       │   │       │   ├── country=US       │   │   └── data.parquet       │   ├── country=CN       │   │   └── data.parquet       │   └── ...       └── gender=female           ├── ...           │           ├── country=US           │   └── data.parquet           ├── country=CN           │   └── data.parquet           └── ...

root|-- name: string (nullable = true)|-- age: long (nullable = true)|-- gender: string (nullable = true)|-- country: string (nullable = true)

通过数据源option设置mergeSchema为true。在全局sql配置中设置spark.sql.parquet.mergeSchema 为true.// This is used to implicitly convert an RDD to a DataFrame.import spark.implicits._
// Create a simple DataFrame, store into a partition directoryval squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")squaresDF.write.parquet("data/test_table/key=1")
// Create another DataFrame in a new partition directory,// adding a new column and dropping an existing columnval cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")cubesDF.write.parquet("data/test_table/key=2")