相关文章推荐
彷徨的豌豆  ·  各种数据分析工具所能处理的数据量大概是多少? ...·  1 月前    · 
愤怒的豆芽  ·  Apache Kyuubi 在 T3 ...·  1 月前    · 
慷慨大方的麦片  ·  How Can I Use Foreach ...·  4 月前    · 
热情的皮带  ·  台灣東販·  5 月前    · 
没人理的砖头  ·  杭州市长河高级中学·  11 月前    · 
踏实的匕首  ·  Python字符切片操作方法splitlin ...·  1 年前    · 
有腹肌的黄花菜  ·  土耳其、以色列、哈萨克斯坦,为何参加欧洲杯?_亚洲·  1 年前    · 
Code  ›  PySpark数据类型转换异常分析开发者社区
python spark
https://cloud.tencent.com/developer/article/1078124
爱听歌的抽屉
2 年前
作者头像
Fayson
0 篇文章

PySpark数据类型转换异常分析

前往专栏
腾讯云
备案 控制台
开发者社区
学习
实践
活动
专区
工具
TVP
文章/答案/技术大牛
写文章
社区首页 > 专栏 > Hadoop实操 > 正文

PySpark数据类型转换异常分析

修改 于 2018-04-01 19:20:26
3.7K 0
举报

温馨提示:要看高清无码套图,请使用手机打开并单击图片放大查看。

1.问题描述


在使用PySpark的SparkSQL读取HDFS的文本文件创建DataFrame时,在做数据类型转换时会出现一些异常,如下:

1.在设置Schema字段类型为DoubleType,抛“name 'DoubleType' is not defined”异常;

2.将读取的数据字段转换为DoubleType类型时抛“Double Type can not accept object u'23' in type <type 'unicode'>”异常;

3.将字段定义为StringType类型,SparkSQL也可以对数据进行统计如sum求和,非数值的数据不会被统计。

具体异常如下:

异常一:

NameError: name 'DoubleType' is not defined
NameErrorTraceback (most recent call last)
in engine
1 schema = StructType([StructField("person_name", StringType(), False),
                       ----> 2                     StructField("person_age", DoubleType(), False)])
NameError: name 'DoubleType' is not defined

异常二:

Py4JJavaError: An error occurred while calling o152.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-26-102.ap-southeast-1.compute.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/session.py", line 509, in prepare
verify_func(obj, schema)
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/sql/types.py", line 1360, in _verify_type
_verify_type(v, f.dataType, f.nullable)
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/sql/types.py", line 1324, in _verify_type
raise TypeError("%s can not accept object %r in type %s" % (dataType, obj, type(obj)))
TypeError: DoubleType can not accept object u'23' in type <type 'unicode'>
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

2.解决方法


  • 异常一:
NameError: name 'DoubleType' is not defined

问题原因:

由于在Python代码中未引入pyspark.sql.types为DoubleType的数据类型导致

解决方法:

from pyspark.sql.types import *

或者

from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType, DoubleType
  • 异常二:
TypeError: DoubleType can not accept object u'23' in type <type 'unicode'>
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)

问题原因:

由于Python默认的字符编码集为unicode,如果需要将字段转为Double类型,则需要进行转换。

解决方法:

# Schema with two fields - person_name and person_age
schema = StructType([StructField("person_name", StringType(), False),
                     StructField("person_age", DoubleType(), False)])
lines = spark.read.text("/tmp/test/").rdd \
    .map(lambda x:x[0].split(",")) \
    .map(lambda x: (x[0], float(x[1])))

增加标红部分代码,将需要转换的字段转换为float类型。

转换完成后代码正常运行。

SparkSQL和DataFrame支持的数据类型参考官网: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types

3.总结


1.在上述测试代码中,如果x1列的数据中有空字符串或者非数字字符串则会导致转换失败,因此在指定字段数据类型的时候,如果数据中存在“非法数据”则需要对数据进行剔除,否则不能正常执行。

测试数据如下:

代码执行报错如下:

Py4JJavaError: An error occurred while calling o291.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-26-102.ap-southeast-1.compute.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-1-29d7e609de57>", line 1, in <lambda>
    ValueError: invalid literal for float(): 23q
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
 
推荐文章
彷徨的豌豆  ·  各种数据分析工具所能处理的数据量大概是多少? - PurStar
1 月前
愤怒的豆芽  ·  Apache Kyuubi 在 T3 出行的深度实践 - 网易数帆
1 月前
慷慨大方的麦片  ·  How Can I Use Foreach Loops in Bash Script?
4 月前
热情的皮带  ·  台灣東販
5 月前
没人理的砖头  ·  杭州市长河高级中学
11 月前
踏实的匕首  ·  Python字符切片操作方法splitlines()按照换行符\r\n分隔字符串_把一串字符串按换行符分割-CSDN博客
1 年前
有腹肌的黄花菜  ·  土耳其、以色列、哈萨克斯坦,为何参加欧洲杯?_亚洲
1 年前
今天看啥   ·   Py中国   ·   codingpro   ·   小百科   ·   link之家   ·   卧龙AI搜索
删除内容请联系邮箱 2879853325@qq.com
Code - 代码工具平台
© 2024 ~ 沪ICP备11025650号