))
.cast(DataTypes.IntegerType)
第五种方式:使用expr()函数
scala> df.withColumn("rsrp4", expr("rsrp * 4")).show
+----+----+----+-----+
| id|rsrp|rsrq|rsrp4|
+----+----+----+-----+
|key1| 23| 1.0| 92|
|key1| 10| 2.0| 40|
+----+----+----+-----+
Dataset删除列
scala> df.drop("rsrp").show
+----+----+
| id|rsrq|
+----+----+
|key1| 1.0|
|key1| 2.0|
+----+----+
scala> df.drop("rsrp","rsrq").show
+----+
| id|
+----+
|key1|
|key1|
+----+
Dataset替换null列
首先,在hadoop目录/user/spark/test.csv
[spark@master ~]$ hadoop fs -text /user/spark/test.csv
key1,key2,key3,key4,key5
aaa,1,2,t1,4
bbb,5,3,t2,8
ccc,2,2,,7
,7,3,t1,
bbb,1,5,t3,0
,4,,t1,8
备注:如果想在根目录下执行spark-shell.需要在/etc/profile中追加spark的安装目录:
export SPARK_HOME=/opt/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
使用spark加载.user/spark/test.csv文件
[spark@master ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/10/29 21:50:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.0.120:4040
Spark context available as 'sc' (master = local[*], app id = local-1540821032565).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val df = spark.read.option("header","true").csv("/user/spark/test.csv")
18/10/29 21:51:16 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
18/10/29 21:51:16 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
18/10/29 21:51:37 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]
scala> df.show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa| 1| 2| t1| 4|
| bbb| 5| 3| t2| 8|
| ccc| 2| 2|null| 7|
|null| 7| 3| t1|null|
| bbb| 1| 5| t3| 0|
|null| 4|null| t1| 8 |
+----+----+----+----+----+
scala> df.schema
res3: org.apache.spark.sql.types.StructType = StructType(StructField(key1,StringType,true), StructField(key2,StringType,true),
StructField(key3,StringType,true), StructField(key4,StringType,true), StructField(key5,StringType,true))
scala> df.printSchema
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: string (nullable = true)
|-- key4: string (nullable = true)
|-- key5: string (nullable = true)
一次修改相同类型的多个列的示例。 这里是把key3,key5列中所有的null值替换成1024。 csv导入时默认是string,如果是整型,写法是一样的,有各个类型的重载。
scala> df.na.fill("1024",Seq("key3","key5")).show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa| 1| 2| t1| 4|
| bbb| 5| 3| t2| 8|
| ccc| 2| 2|null| 7|
|null| 7| 3| t1|1024|
| bbb| 1| 5| t3| 0|
|null| 4|1024| t1| 8 |
+----+----+----+----+----+
一次修改不同类型的多个列的示例。 csv导入时默认是string,如果是整型,写法是一样的,有各个类型的重载。
scala> df.na.fill(Map(("key1"->"yyy"),("key3","1024"),("key4","t88"),("key5","4096"))).show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa| 1| 2| t1| 4|
| bbb| 5| 3| t2| 8|
| ccc| 2| 2| t88| 7|
| yyy| 7| 3| t1|4096|
| bbb| 1| 5| t3| 0|
| yyy| 4|1024| t1| 8 |
+----+----+----+----+----+
不修改,只是过滤掉含有null值的行。 这里是过滤掉key3,key5列中含有null的行
scala> df.na.drop(Seq("key3","key5")).show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa| 1| 2| t1| 4|
| bbb| 5| 3| t2| 8|
| ccc| 2| 2|null| 7|
| bbb| 1| 5| t3| 0|
+----+----+----+----+----+
过滤掉指定的若干列中,有效值少于n列的行 这里是过滤掉key1,key2,key3这3列中有效值小于2列的行。最后一行中,这3列有2列都是null,所以被过滤掉了。
scala> df.na.drop(2,Seq("key1","key2","key3")).show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa| 1| 2| t1| 4|
| bbb| 5| 3| t2| 8|
| ccc| 2| 2|null| 7|
|null| 7| 3| t1|null|
| bbb| 1| 5| t3| 0|
+----+----+----+----+----+
同上,如果不指定列名列表,则默认列名列表就是所有列
scala> df.na.drop(4).show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa| 1| 2| t1| 4|
| bbb| 5| 3| t2| 8|
| ccc| 2| 2|null| 7|
| bbb| 1| 5| t3| 0|
+----+----+----+----+----+
https://blog.csdn.net/coding_hello/article/details/75211995
https://blog.csdn.net/xuejianbest/article/details/81666065
基础才是编程人员应该深入研究的问题,比如:
1)List/Set/Map内部组成原理|区别
2)mysql索引存储结构&如何调优/b-tree特点、计算复杂度及影响复杂度的因素。。。
3)JVM运行组成与原理及调优
4)Java类加载器运行原理
5)Java中GC过程原理|使用的回收算法原理
6)Redis中hash一致性实现及与hash其他区别
7)Java多线程、线程池开发、管理Lock与Synchroined区别
8)Spring IOC/AOP 原理;加载过程的。。。
【+加关注】。