Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I used JDBC driver to connect pyspark to MySql. I can retrieve data from mysql using

df = sqlContext.sql("select * from dbTable")

This query works all fine. My question is, how can I query on "info" column? For example, below query works all fine in MySQL shell and retrieve data but this is not supported in Pyspark (2+).

select id, info->"$.name" from dbTable where info->"$.name"='pat'
from pyspark.sql.functions import *
res = df.select(get_json_object(df['info'],"$.name").alias('name'))
res = df.filter(get_json_object(df['info'], "$.name") == 'pat')

There is already a function named get_json_object

For your situation:

df = spark.read.jdbc(url='jdbc:mysql://localhost:3306', table='test.test_json',
                     properties={'user': 'hive', 'password': '123456'})
df.createOrReplaceTempView('test_json')
res = spark.sql("""
select col_json,get_json_object(col_json,'$.name') from test_json
res.show()

Spark sql is almost like HIVE sql, you can see

https://cwiki.apache.org/confluence/display/Hive/Home

Thanks for your reply. This method only works when the data is loaded in a data frame. There are hundreds of thousands of records. It might not be efficient way to load the complete table and filter the data against it. Is there way to retrieve the data only matched (json search) with a query rather than loading the complete table? – ciri Jan 10, 2017 at 21:04

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.