Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams
from pyspark import SparkContext  
from pyspark import SparkConf     
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('simpleTest')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print sc.version
json_file = 'json'
df = sqlContext.read.json(json_file,timestampFormat='yyyy-MM-dd HH:mm:ss')
df.printSchema()

The output is:

2.0.2
 |-- myTime: string (nullable = true)

I expected the schema to be defined as timestamp. What am i missing?

You need to define a schema explictly:

from pyspark.sql.types import StructType, StructField, TimestampType
schema = StructType([StructField("myTime", TimestampType(), True)])
df = spark.read.json(json_file, schema=schema, timestampFormat="yyyy-MM-dd HH:mm:ss")

This will output:

>>> df.collect()
[Row(myTime=datetime.datetime(2016, 10, 26, 18, 19, 15))]
>>> df.printSchema()
 |-- myTime: timestamp (nullable = true)
                so the point in defining "timestampFormat" is how timestamp strings will be interpreted when tried to fit into a column of "TimestampType()"  in a schema ?
– lioran
                May 10, 2017 at 8:17

Additional to Dat Tran solution, you can also directly apply cast to dataframe column after reading the file.

# example
from pyspark.sql import Row
json = [Row(**{"myTime": "2016-10-26 18:19:15"})]
df = spark.sparkContext.parallelize(json).toDF()
# using cast to 'timestamp' format
df_time = df.select(df['myTime'].cast('timestamp'))
df_time.printSchema()
 |-- myTime: timestamp (nullable = true)
                would it be an impact on performance to do it this way on large datasets, since we preform 2 data-frame operations?
– lioran
                May 10, 2017 at 8:13
                It will take a little more time but not too much IMO since this is just applying cast function to the column.
– titipata
                May 10, 2017 at 8:22
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.