


Property Name Meaning
partitionColumn, lowerBound, upperBound These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
numPartitions The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.
public static Dataset<Row> sparkLoad(SparkSession spark, String url, String fullTable, 
            String partitionColumn, long lowerBound, long upperBound, int numPartitions) {
        DataFrameReader reader = spark.read().format("jdbc").option("url", url)
                .option("dbtable", fullTable).option("user", "postgres")
                .option("password", "webgis327");
        if(partitionColumn != null){
            reader = reader.option("partitionColumn", partitionColumn)
                          .option("lowerBound", lowerBound)
                          .option("upperBound", upperBound)
                          .option("numPartitions", numPartitions);
        return reader.load();


Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned.



第一个分区:select * from tablename where id<=10;
第二个分区:select * from tablename where id >=10 and id<20;
第三个分区:select * from tablename where id >=20 and id <30;
第十个分区:select * from tablename where id >=90;


如何加快数据的读取过程 利用SparkSQL读取数据库数据的时候,如果数据量很大,那么在读取数据的时候就会花费大量的时间,因此,怎么让数据并行读取加快读取数据的速度呢?在SparkSQL中,读取数据的时候可以分块读取。例如下面这样,指定了partitionColumn,lowerBound,upperBound,numPartitions等读取数据的参数。
如何理解SparkSQL中的partitionColumn, lowerBound, upperBound, numPartitions 在SparkSQL中,读取数据的时候可以分块读取。例如下面这样,指定了partitionColumnlowerBoundupperBoundnumPartitions等读取数据的参数。简单来说,就是并行读取。 关于这四个参数的意思,SparkSQL官方解释是:
