文档地址: http://spark.apache.org/docs/latest/api/sql/index.html
一、时间函数
1. 添加月份
add_months(startDate: Column, numMonths: Int)
指定日期添加n月
scala> spark.sql("select add_months('2018-01-01',3)").show
+---------------------------------------+
|add_months(CAST(2018-01-01 AS DATE), 3)|
+---------------------------------------+
| 2018-04-01|
+---------------------------------------+
2. 增加天数
date_add(start: Column, days: Int)
指定日期之后n天
scala> spark.sql("select date_add('2018-01-01',3)").show
+-------------------------------------+
|date_add(CAST(2018-01-01 AS DATE), 3)|
+-------------------------------------+
| 2018-01-04|
+-------------------------------------+
3. 减少天数
date_sub(start: Column, days: Int)
指定日期之前n天
scala> spark.sql("select date_sub('2018-01-01',3)").show
+-------------------------------------+
|date_sub(CAST(2018-01-01 AS DATE), 3)|
+-------------------------------------+
| 2017-12-29|
+-------------------------------------+
4. 间隔天数
datediff(end: Column, start: Column)
两日期间隔天数
scala> spark.sql("select datediff('2018-01-01','2017-12-09')").show
+------------------------------------------------------------+
|datediff(CAST(2018-01-01 AS DATE), CAST(2017-12-09 AS DATE))|
+------------------------------------------------------------+
| 23|
+------------------------------------------------------------+
5. 当前日期current_date()
scala> spark.sql("select current_date()").show
+--------------+
|current_date()|
+--------------+
| 2020-04-04|
+--------------+
6. 当前时间戳
current_timestamp()
当前时间戳,TimestampType类型
7. date_format(dateExpr: Column, format: String)
日期格式化
scala> spark.sql("select date_format('2018-05-06','YYYY年MM月dd日')").show
+----------------------------------------------------------+
|date_format(CAST(2018-05-06 AS TIMESTAMP), YYYY年MM月dd日)|
+----------------------------------------------------------+
| 2018年05月06日|
+----------------------------------------------------------+
8. dayofmonth(e: Column)
日期在一月中的天数,支持 date/timestamp/string
scala> spark.sql("select dayofmonth('2018-05-06')").show
+------------------------------------+
|dayofmonth(CAST(2018-05-06 AS DATE))|
+------------------------------------+
| 6|
+------------------------------------+
9. 日期在一年中的天数
dayofyear(e: Column)
日期在一年中的天数, 支持 date/timestamp/string
scala> spark.sql("select dayofyear('2018-05-06')").show
+-----------------------------------+
|dayofyear(CAST(2018-05-06 AS DATE))|
+-----------------------------------+
| 126|
+-----------------------------------+
10. 日期在一年中的周数
weekofyear(e: Column)
日期在一年中的周数, 支持 date/timestamp/string
scala> spark.sql("select weekofyear('2018-05-06')").show
+------------------------------------+
|weekofyear(CAST(2018-05-06 AS DATE))|
+------------------------------------+
| 18|
+------------------------------------+
11. from_unixtime(ut: Column, f: String)
时间戳转字符串格式
scala> spark.sql("SELECT FROM_UNIXTIME(1562319256,'YYYY-MM-dd HH:mm:ss')").show
+--------------------------------------------------------------+
|from_unixtime(CAST(1562319256 AS BIGINT), YYYY-MM-dd HH:mm:ss)|
+--------------------------------------------------------------+
| 2019-07-05 17:34:16|
+--------------------------------------------------------------+
12.from_utc_timestamp(ts: Column, tz: String) 返回时间
时间戳转指定时区时间戳
scala> spark.sql("SELECT from_utc_timestamp('1970-01-01 08:00:00','PST') as bj_time").show
+-------------------+
| bj_time|
+-------------------+
|1970-01-01 00:00:00|
+-------------------+
scala> spark.sql("SELECT from_utc_timestamp('2016-08-31 00:00:00', 'Asia/Shanghai') as bj_time").show
+-------------------+
| bj_time|
+-------------------+
|2016-08-31 08:00:00|
+-------------------+
GMT,即格林尼治标准时间,也就是世界时。
UTC,即协调世界时。UTC现在作为世界标准时间使用。
13.to_utc_timestamp(ts: Column, tz: String) 返回当前时间
指定时区时间戳转UTF时间戳
scala> spark.sql("SELECT to_utc_timestamp('1970-01-01 00:00:00','PST') as bj_time").show
+-------------------+
| bj_time|
+-------------------+
|1970-01-01 08:00:00|
+-------------------+
scala> spark.sql("SELECT to_utc_timestamp('2016-08-31', 'Asia/Shanghai') as bj_time").show
+-------------------+
| bj_time|
+-------------------+
|2016-08-30 16:00:00|
+-------------------+
14.hour(e: Column)
提取小时值
scala> spark.sql("SELECT hour('2016-08-31 12:00:00')").show
+--------------------------------------------+
|hour(CAST(2016-08-31 12:00:00 AS TIMESTAMP))|
+--------------------------------------------+
| 12|
+--------------------------------------------+
15.minute(e: Column)
提取分钟值
scala> spark.sql("SELECT minute('2016-08-31 12:35:00')").show
+----------------------------------------------+
|minute(CAST(2016-08-31 12:35:00 AS TIMESTAMP))|
+----------------------------------------------+
| 35|
+----------------------------------------------+
16.month(e: Column)
提取月份值
scala> spark.sql("SELECT month('2016-08-31 12:35:00')").show
+----------------------------------------+
|month(CAST(2016-08-31 12:35:00 AS DATE))|
+----------------------------------------+
| 8|
+----------------------------------------+
17.quarter(e: Column)
scala> spark.sql("SELECT quarter('2016-08-31 12:35:00')").show
+------------------------------------------+
|quarter(CAST(2016-08-31 12:35:00 AS DATE))|
+------------------------------------------+
| 3|
+------------------------------------------+
18.second(e: Column)
scala> spark.sql("SELECT second('2016-08-31 12:35:20')").show
+----------------------------------------------+
|second(CAST(2016-08-31 12:35:20 AS TIMESTAMP))|
+----------------------------------------------+
| 20|
+----------------------------------------------+
19.year(e: Column):提取年
scala> spark.sql("SELECT year('2016-08-31 12:35:20')").show
+---------------------------------------+
|year(CAST(2016-08-31 12:35:20 AS DATE))|
+---------------------------------------+
| 2016|
+---------------------------------------+
20.last_day(e: Column)
指定日期的月末日期
scala> spark.sql("SELECT last_day('2016-02-02 12:35:20')").show
+-------------------------------------------+
|last_day(CAST(2016-02-02 12:35:20 AS DATE))|
+-------------------------------------------+
| 2016-02-29|
+-------------------------------------------+
21.months_between(date1: Column, date2: Column)
计算两日期差几个月
scala> spark.sql("SELECT months_between('2016-03-02 12:35:20','2016-02-03')").show
+-------------------------------------------------------------------------------------------+
|months_between(CAST(2016-03-02 12:35:20 AS TIMESTAMP), CAST(2016-02-03 AS TIMESTAMP), true)|
+-------------------------------------------------------------------------------------------+
| 0.98466249|
+-------------------------------------------------------------------------------------------+
22.next_day(date: Column, dayOfWeek: String)
计算指定日期之后的下一个周一、二...,dayOfWeek区分大小写,只接受 "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"。
scala> spark.sql("SELECT next_day('2016-03-02 12:35:20','Mon')").show
+------------------------------------------------+
|next_day(CAST(2016-03-02 12:35:20 AS DATE), Mon)|
+------------------------------------------------+
| 2016-03-07|
+------------------------------------------------+
23.to_date(e: Column):主要用于类型转换
字段类型转为DateType
scala> spark.sql("SELECT to_date('2016-03-02 12:35:20')").show
+------------------------------+
|to_date('2016-03-02 12:35:20')|
+------------------------------+
| 2016-03-02|
+------------------------------+
24.trunc(date: Column, format: String):日期截断
返回date,将一天中的时间部分截断为format模型指定的单位fmt。 fmt应该是[“ year”,“ yyyy”,“ yy”,“ mon”,“ month”,“ mm”中的一个
scala> spark.sql("SELECT trunc('2009-02-12', 'MM')").show
+-----------------------------------+
|trunc(CAST(2009-02-12 AS DATE), MM)|
+-----------------------------------+
| 2009-02-01|
+-----------------------------------+
scala> spark.sql("SELECT trunc('2015-10-27', 'YEAR')").show
+-------------------------------------+
|trunc(CAST(2015-10-27 AS DATE), YEAR)|
+-------------------------------------+
| 2015-01-01|
+-------------------------------------+
25.unix_timestamp(s: Column, p: String)
指定格式的时间字符串转时间戳
scala> spark.sql("SELECT unix_timestamp()").show
+--------------------------------------------------------+
|unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss)|
+--------------------------------------------------------+
| 1585959401|
+--------------------------------------------------------+
scala> spark.sql("SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd')").show
+--------------------------------------+
|unix_timestamp(2016-04-08, yyyy-MM-dd)|
+--------------------------------------+
| 1460044800|
+--------------------------------------+
二、数学函数
1.cos,sin,tan
计算角度的余弦,正弦。。。
scala> spark.sql("SELECT cos(0)").show
+----------------------+
|COS(CAST(0 AS DOUBLE))|
+----------------------+
| 1.0|
+----------------------+
2.sinh,tanh,cosh
计算双曲正弦,正切,。。
scala> spark.sql("SELECT sinh(0)").show
+-----------------------+
|SINH(CAST(0 AS DOUBLE))|
+-----------------------+
| 0.0|
+-----------------------+
3.acos,asin,atan,atan2
计算余弦/正弦值对应的角度
scala> spark.sql("SELECT acos(0)").show
+-----------------------+
|ACOS(CAST(0 AS DOUBLE))|
+-----------------------+
| 1.5707963267948966|
+-----------------------+
4.bin
将long类型转为对应二进制数值的字符串For example, bin("12") returns "1100".
scala> spark.sql("SELECT bin(3)").show
+----------------------+
|bin(CAST(3 AS BIGINT))|
+----------------------+
| 11|
+----------------------+
5.bround
舍入,使用Decimal的HALF_EVEN模式,v>0.5向上舍入,v< 0.5向下舍入,v0.5向最近的偶数舍入。
scala> spark.sql("SELECT bround(3.6)").show
+--------------+
|bround(3.6, 0)|
+--------------+
| 4|
+--------------+
scala> spark.sql("SELECT bround(5.5)").show
+--------------+
|bround(5.5, 0)|
+--------------+
| 6|
+--------------+
scala> spark.sql("SELECT bround(4.5)").show
+--------------+
|bround(4.5, 0)|
+--------------+
| 4|
+--------------+
6.round(e: Column, scale: Int)
HALF_UP模式舍入到scale为小数点。v>=0.5向上舍入,v< 0.5向下舍入,即四舍五入。
scala> spark.sql("SELECT round(4.5)").show
+-------------+
|round(4.5, 0)|
+-------------+
| 5|
+-------------+
7.ceil
scala> spark.sql("SELECT ceil(4.5)").show
+---------+
|CEIL(4.5)|
+---------+
| 5|
+---------+
8.floor
scala> spark.sql("SELECT floor(4.5)").show
+----------+
|FLOOR(4.5)|
+----------+
| 4|
+----------+
9.cbrt
返回的立方根
scala> spark.sql("SELECT cbrt(27.0)").show
+--------------------------+
|CBRT(CAST(27.0 AS DOUBLE))|
+--------------------------+
| 3.0|
+--------------------------+
10.conv(num:Column, fromBase: Int, toBase: Int)
转换数值(字符串)的进制
# 二进制转为10进制
scala> spark.sql("SELECT conv('100', 2, 10)").show
+----------------+
|conv(100, 2, 10)|
+----------------+
| 4|
+----------------+
# 十六进制转为十进制--负数
scala> spark.sql("SELECT conv(-10, 16, -10)").show
+----------------------------------+
|conv(CAST(-10 AS STRING), 16, -10)|
+----------------------------------+
| -16|
+----------------------------------+
11.log(base: Double, a: Column):
对数函数。
scala> spark.sql("SELECT log(10, 100)").show
+--------------------------------------------+
|LOG(CAST(10 AS DOUBLE), CAST(100 AS DOUBLE))|
+--------------------------------------------+
| 2.0|
+--------------------------------------------+
12.log(a: Column):
scala> spark.sql("SELECT log(10)").show
+----------------------------+
|LOG(E(), CAST(10 AS DOUBLE))|
+----------------------------+
| 2.302585092994046|
+----------------------------+
13.log10(a: Column):
scala> spark.sql("SELECT log10(10)").show
+-------------------------+
|LOG10(CAST(10 AS DOUBLE))|
+-------------------------+
| 1.0|
+-------------------------+
14.log2(a: Column):
scala> spark.sql("SELECT log2(2)").show
+-----------------------+
|LOG2(CAST(2 AS DOUBLE))|
+-----------------------+
| 1.0|
+-----------------------+
15.log1p(a: Column):
scala> spark.sql("SELECT log1p(0)").show
+------------------------+
|LOG1P(CAST(0 AS DOUBLE))|
+------------------------+
| 0.0|
+------------------------+
16.pmod(dividend: Column, divisor: Column):返回expr1 mod 的正值expr2
scala> spark.sql("SELECT pmod(10, 3)").show
+-----------+
|pmod(10, 3)|
+-----------+
| 1|
+-----------+
scala> spark.sql("SELECT pmod(-10, 3)").show
+------------+
|pmod(-10, 3)|
+------------+
| 2|
+------------+
负数取余遵循公式:
a与d是整数,d非零,那么余数r满足
a=q*d+r,q为整数,且0<=|r|<|d|
通常可能有两个r满足定义,r1,r2分别称为正余数、负余数,且r1=r2+d.
17.pow(l: Double, r: Column): 注意r是列
scala> spark.sql("SELECT pow(2, 3)").show
+-------------------------------------------+
|POWER(CAST(2 AS DOUBLE), CAST(3 AS DOUBLE))|
+-------------------------------------------+
| 8.0|
+-------------------------------------------+
18.pow(l: Column, r: Double): 注意l是列
scala> spark.sql("SELECT pow(2, 3.3)").show
+---------------------------------------------+
|POWER(CAST(2 AS DOUBLE), CAST(3.3 AS DOUBLE))|
+---------------------------------------------+
| 9.849155306759329|
+---------------------------------------------+
19.pow(l: Column, r: Column): 注意r,l都是列
20.radians(e: Column):角度转弧度
scala> spark.sql("SELECT radians(180)").show
+----------------------------+
|RADIANS(CAST(180 AS DOUBLE))|
+----------------------------+
| 3.141592653589793|
+----------------------------+
21.rint(e: Column):返回其值最接近参数且等于数学整数的double值。
scala> spark.sql("SELECT rint(12.3456)").show
+------------------------------+
|ROUND(CAST(12.3456 AS DOUBLE))|
+------------------------------+
| 12.0|
+------------------------------+
22.shiftLeft(e: Column, numBits: Int):向左位移
scala> spark.sql("SELECT shiftleft(2, 1)").show
+---------------+
|shiftleft(2, 1)|
+---------------+
| 4|
+---------------+
23.shiftRight(e: Column, numBits: Int):向右位移
scala> spark.sql("SELECT shiftRight(2, 1)").show
+----------------+
|shiftright(2, 1)|
+----------------+
| 1|
+----------------+
24.shiftRightUnsigned(e: Column, numBits: Int):向右位移(无符号位)
scala> spark.sql("SELECT shiftrightunsigned(4, 1)").show
+------------------------+
|shiftrightunsigned(4, 1)|
+------------------------+
| 2|
+------------------------+
25.signum(e: Column):返回数值正负符号
scala> spark.sql("SELECT signum(40)").show
+--------------------------+
|SIGNUM(CAST(40 AS DOUBLE))|
+--------------------------+
| 1.0|
+--------------------------+
scala> spark.sql("SELECT signum(-40)").show
+---------------------------+
|SIGNUM(CAST(-40 AS DOUBLE))|
+---------------------------+
| -1.0|
+---------------------------+
26.sqrt(e: Column):平方根
scala> spark.sql("SELECT sqrt(4)").show
+-----------------------+
|SQRT(CAST(4 AS DOUBLE))|
+-----------------------+
| 2.0|
+-----------------------+
27.hex(column: Column):转十六进制
scala> spark.sql("SELECT hex(17)").show
+-----------------------+
|hex(CAST(17 AS BIGINT))|
+-----------------------+
| 11|
+-----------------------+
scala> spark.sql("SELECT hex('Spark SQL')").show
+------------------+
| hex(Spark SQL)|
+------------------+
|537061726B2053514C|
+------------------+
28.unhex(column: Column):逆转十六进制
scala> spark.sql("SELECT decode(unhex('537061726B2053514C'), 'UTF-8')").show
+----------------------------------------+
|decode(unhex(537061726B2053514C), UTF-8)|
+----------------------------------------+
| Spark SQL|
+----------------------------------------+
三、排序函数
测试数据:
{"EMPNO": 7369,"ENAME": "SMITH","JOB": "CLERK","MGR": 7902,"HIREDATE": "1980-12-17 00:00:00","SAL": 800.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7499,"ENAME": "ALLEN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-20 00:00:00","SAL": 1600.00,"COMM": 300.00,"DEPTNO": 30}
{"EMPNO": 7521,"ENAME": "WARD","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-22 00:00:00","SAL": 1250.00,"COMM": 500.00,"DEPTNO": 30}
{"EMPNO": 7566,"ENAME": "JONES","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-04-02 00:00:00","SAL": 2975.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7654,"ENAME": "MARTIN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-28 00:00:00","SAL": 1250.00,"COMM": 1400.00,"DEPTNO": 30}
{"EMPNO": 7698,"ENAME": "BLAKE","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-05-01 00:00:00","SAL": 2850.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7782,"ENAME": "CLARK","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-06-09 00:00:00","SAL": 2450.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7788,"ENAME": "SCOTT","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1987-04-19 00:00:00","SAL": 1500.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7839,"ENAME": "KING","JOB": "PRESIDENT","MGR": null,"HIREDATE": "1981-11-17 00:00:00","SAL": 5000.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7844,"ENAME": "TURNER","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-08 00:00:00","SAL": 1500.00,"COMM": 0.00,"DEPTNO": 30}
{"EMPNO": 7876,"ENAME": "ADAMS","JOB": "CLERK","MGR": 7788,"HIREDATE": "1987-05-23 00:00:00","SAL": 1100.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7900,"ENAME": "JAMES","JOB": "CLERK","MGR": 7698,"HIREDATE": "1981-12-03 00:00:00","SAL": 950.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7902,"ENAME": "FORD","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1981-12-03 00:00:00","SAL": 3000.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7934,"ENAME": "MILLER","JOB": "CLERK","MGR": 7782,"HIREDATE": "1982-01-23 00:00:00","SAL": 1300.00,"COMM": null,"DEPTNO": 10}
以编程的方式使用:
scala> val df=spark.read.json("/root/emp.json")
1.asc(columnName: String) 正序
scala> df.sort(asc("deptno")).show
+------+------+-----+------+-------------------+---------+----+------+
| COMM|DEPTNO|EMPNO| ENAME| HIREDATE| JOB| MGR| SAL|
+------+------+-----+------+-------------------+---------+----+------+
| null| 10| 7839| KING|1981-11-17 00:00:00|PRESIDENT|null|5000.0|
| null| 20| 7566| JONES|1981-04-02 00:00:00| MANAGER|7839|2975.0|
| null| 20| 7876| ADAMS|1987-05-23 00:00:00| CLERK|7788|1100.0|
| 300.0| 30| 7499| ALLEN|1981-02-20 00:00:00| SALESMAN|7698|1600.0|
2.asc_nulls_first(columnName: String) 正序,null排最前
scala> df.sort(asc_nulls_first("comm")).show
+------+------+-----+------+-------------------+---------+----+------+
| COMM|DEPTNO|EMPNO| ENAME| HIREDATE| JOB| MGR| SAL|
+------+------+-----+------+-------------------+---------+----+------+
| null| 20| 7369| SMITH|1980-12-17 00:00:00| CLERK|7902| 800.0|
| null| 20| 7876| ADAMS|1987-05-23 00:00:00| CLERK|7788|1100.0|
| 0.0| 30| 7844|TURNER|1981-09-08 00:00:00| SALESMAN|7698|1500.0|
| 300.0| 30| 7499| ALLEN|1981-02-20 00:00:00| SALESMAN|7698|1600.0|
| 500.0| 30| 7521| WARD|1981-02-22 00:00:00| SALESMAN|7698|1250.0|
|1400.0| 30| 7654|MARTIN|1981-09-28 00:00:00| SALESMAN|7698|1250.0|
+------+------+-----+------+-------------------+---------+----+------+
3.asc_nulls_last(columnName: String) 正序,null排最后
scala> df.sort(asc_nulls_last("comm")).show
+------+------+-----+------+-------------------+---------+----+------+
| COMM|DEPTNO|EMPNO| ENAME| HIREDATE| JOB| MGR| SAL|
+------+------+-----+------+-------------------+---------+----+------+
| 0.0| 30| 7844|TURNER|1981-09-08 00:00:00| SALESMAN|7698|1500.0|
| 300.0| 30| 7499| ALLEN|1981-02-20 00:00:00| SALESMAN|7698|1600.0|
| 500.0| 30| 7521| WARD|1981-02-22 00:00:00| SALESMAN|7698|1250.0|
|1400.0| 30| 7654|MARTIN|1981-09-28 00:00:00| SALESMAN|7698|1250.0|