Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams
import findspark
findspark.init(spark_home='/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive', python_path='python2.7')
    import pyspark
    from pyspark.sql import *
    sc = pyspark.sql.SparkSession.builder.master("yarn-client").config("spark.executor.memory", "2g").config('spark.driver.memory', '1g').config('spark.driver.cores', '4').enableHiveSupport().getOrCreate()
    sqlContext = SQLContext(sc)

Then when I do:

spark_df = sqlContext.createDataFrame(df_in)

where df_in is a pandas dataframe. I then got the following errors:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-1db231ce21c9> in <module>()
----> 1 spark_df = sqlContext.createDataFrame(df_in)
/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
    297         Py4JJavaError: ...
    298         """
--> 299         return self.sparkSession.createDataFrame(data, schema, samplingRatio)
    301     @since(1.3)
/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
/home/edamame/spark/spark-2.0.0-bin-spark-2.0.0-bin-hadoop2.6-hive/python/pyspark/sql/session.pyc in _createFromLocal(self, data, schema)
    400         # convert python objects to sql data
    401         data = [schema.toInternal(row) for row in data]
--> 402         return self._sc.parallelize(data), schema
    404     @since(2.0)
AttributeError: 'SparkSession' object has no attribute 'parallelize'

Does anyone know what I did wrong? Thanks!

SparkSession is not a replacement for a SparkContext but an equivalent of the SQLContext. Just use it use the same way as you used to use SQLContext:

spark.createDataFrame(...)

and if you ever have to access SparkContext use sparkContext attribute:

spark.sparkContext

so if you need SQLContext for backwards compatibility you can:

SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
                I then tried: spark_df = sc.createDataFrame(df_in) , but the spark_df seems to be corrupted. Is spark_df = sc.createDataFrame(df_in) the right way to do the conversion here?
– Edamame
                Sep 15, 2016 at 23:56

Whenever we are trying to create a DF from a backward-compatible object like RDD or a data frame created by spark session, you need to make your SQL context-aware about your session and context.

Like Ex:

If I create a RDD:

ss=SparkSession.builder.appName("vivek").master('local').config("k1","vi").getOrCreate()
rdd=ss.sparkContext.parallelize([('Alex',21),('Bob',44)])

But if we wish to create a df from this RDD, we need to

sq=SQLContext(sparkContext=ss.sparkContext, sparkSession=ss)

then only we can use SQLContext with RDD/DF created by pandas.

schema = StructType([
   StructField("name", StringType(), True),
   StructField("age", IntegerType(), True)])
df=sq.createDataFrame(rdd,schema)
df.collect()
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.