SPARK read.jdbc & custom schema - Stack Overflow

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
With spark.read.format ... once can add the custom schema non-programmatically, like so:
val df = sqlContext
.read()
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true")
.option("user", "root")
.option("password", "password")
.option("dbtable", sql)
.schema(customSchema)
.load();
However, using spark.read.jdbc, I cannot seem to do the same or find the syntax to do the same as for the above. What am i missing or has this changed in SPARK 2.x? I read this in the manual: ... Spark automatically reads the schema from the database table and maps its types back to Spark SQL types. ...  Presumably what I am trying to do is no longer possible as in the above example. 
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
I ended up trying this:
val dataframe_mysql = spark.read.schema(openPositionsSchema).jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
and got this:
org.apache.spark.sql.AnalysisException: User specified schema not supported with `jdbc`;
Seems a retrograde step in a certain way.
                In spark 2 you can pass schema as spark.read.format().schema .. However, spark.read.jdbc returns dataset. You can specify schema to DataFrameReader not to Dataset. A weird solution can be ..  RDD<Row> rdd = sparkSession.read().jdbc("","", null).rdd(); sparkSession.createDataFrame(rdd, schema);
– Sourav Gulati
                Jun 4, 2018 at 9:41
                It returns dataset only.I tried in Java though. you can also use createDataSet function and provide appropriate encoder
– Sourav Gulati
                Jun 4, 2018 at 9:49
                I am using scala and get in this case dataframe_mysql: org.apache.spark.sql.DataFrame = [k: int, v: int]
– thebluephantom
                Jun 4, 2018 at 9:59
I do not agree with the answer.
You can supply custom schema using your method or by setting properties:
 connectionProperties.put("customSchema", schemachanges);
Where schema changes in format "field Name" "New data type", ... :
 "key String, value DECIMAL(20, 0)"
If key was an number in original table, it will generate an SQL query like "key::character varying, value::numeric(20, 0)"
It is better than a cast, because cast is a mapping operation executed after it selected in original type, custom schema is not.
I had a case, when spark can not select NaN from postgres Numeric, because it maps numerics into java BigDecimal which does not allow NaN, so spark job failed every time when reading those values. Cast produced the same result. However after changing a scheme to either String or Double, it was able to read it properly.
Spark documentation: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
                In my case, I just put the schema to the connection properties, which are provided to jdbc call. It worked, on spark 2.3.*, 2.4.*. Did not tested on different versions. It is possible that just schema() method is not suitable for the jdbc case.
– Volodymyr Zubariev
                Apr 4, 2019 at 19:22
                Tested on 2.2.1, works as well. This question is old, but if you still interested in specific version I can check. I hope all 2.* has support of custom schema via properties parameter.
– Volodymyr Zubariev
                Apr 4, 2019 at 23:29
You can use a Custom schema and put in the properties parameters. You can read more at https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Create a variable:
c_schema = 'id_type INT'
Properties conf:
config = {"user":"xxx", 
          "password": "yyy",
          "driver":"com.mysql.jdbc.Driver",
          "customSchema":c_schema}
Read the table and create the DF:
df = spark.read.jdbc(url=jdbc_url,table='table_name',properties=config)
You must use the same column name and it's going to change only the column 
you put inside the customized schema.
You don't miss anything. Modifying schema on read with JDBC sources was never supported. The input is already typed so there there is no place for schema.
If the types are not satisfying, just cast the results to the desired types.
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.