Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
With spark.read.format ... once can add the custom schema non-programmatically, like so:
val df = sqlContext
.read()
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true")
.option("user", "root")
.option("password", "password")
.option("dbtable", sql)
.schema(customSchema)
.load();
However, using spark.read.jdbc, I cannot seem to do the same or find the syntax to do the same as for the above. What am i missing or has this changed in SPARK 2.x? I read this in the manual: ... Spark automatically reads the schema from the database table and maps its types back to Spark SQL types. ... Presumably what I am trying to do is no longer possible as in the above example.
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
I ended up trying this:
val dataframe_mysql = spark.read.schema(openPositionsSchema).jdbc(jdbcUrl, "(select k, v from sample) e ", connectionProperties)
and got this:
org.apache.spark.sql.AnalysisException: User specified schema not supported with `jdbc`;
Seems a retrograde step in a certain way.
–
–
–
I do not agree with the answer.
You can supply custom schema using your method or by setting properties:
connectionProperties.put("customSchema", schemachanges);
Where schema changes in format "field Name" "New data type", ... :
"key String, value DECIMAL(20, 0)"
If key was an number in original table, it will generate an SQL query like "key::character varying, value::numeric(20, 0)"
It is better than a cast, because cast is a mapping operation executed after it selected in original type, custom schema is not.
I had a case, when spark can not select NaN from postgres Numeric, because it maps numerics into java BigDecimal which does not allow NaN, so spark job failed every time when reading those values. Cast produced the same result. However after changing a scheme to either String or Double, it was able to read it properly.
Spark documentation: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
–
–
You can use a Custom schema and put in the properties parameters. You can read more at https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Create a variable:
c_schema = 'id_type INT'
Properties conf:
config = {"user":"xxx",
"password": "yyy",
"driver":"com.mysql.jdbc.Driver",
"customSchema":c_schema}
Read the table and create the DF:
df = spark.read.jdbc(url=jdbc_url,table='table_name',properties=config)
You must use the same column name and it's going to change only the column
you put inside the customized schema.
You don't miss anything. Modifying schema on read with JDBC sources was never supported. The input is already typed so there there is no place for schema
.
If the types are not satisfying, just cast
the results to the desired types.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.