分组排序:
https://blog.csdn.net/weixin_40161254/article/details/88817225
df_spark_hotpoi = spark.sql("select routeid, cityid, row_number() over (partition by routeid order by sortno asc) as rank from table where sortno<=5 ")
df_spark_hotpoi3.orderBy(["cityid" ,'rank'], ascending= [1,1] ).show()
单条件、多条件groupby
https://blog.csdn.net/weixin_42864239/article/details/94456765
http://www.it1352.com/837888.html
不接agg的计数
https://blog.csdn.net/m0_38052384/article/details/100362340
接agg的:
在groupby之后需要接agg,再进行其他操作
import pyspark.sql.functions as F
# 建立数据
df = spark.createDataFrame([
("a","None","None","code3"),
("b","code1","None","code5"),
("b","code2","name2","code5"),
("b","code2","name2","code4"),
["id","code","name","try"])
df.show()
# 进行groupby
单个列进行groupby
.groupby("id","TRY")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show()
# 情况2
如果需要保留多个原始字段,则需要同时对这多个字段进行groupby
这几个字段应该具有相同的对应关系,则之后的关系也是对应的
# 情况3
如果需要保留多个原始字段,则需要同时对这多个字段进行groupby
如果这几个字段具有不同的对应关系,则会对应多个不同的分组,依次以各个gourpby的字段进行分组
#看哪个网站对各大战区贡献CTM业绩最多
qd_cdf_eachnet=qd_cdf[['Root Id','成交类型','成交网站','月份','单数(拆分)',\
'业绩(拆分)','成交区董']].drop_duplicates().\
group
by(['成交区董','成交网站']).\
一些情况下,我们需要将数据按照某种条件划分,一部分满足条件的进行分析,另一部分不满足条件的划分为另一组进行分析。
假设我们有如下数据:
from
pyspark
.sql import Row, functions as F
col_names = ["name", "score"]
value = [
("Red", 100.0),
("Origen", 80.0),
("Yellow", 55.0),
("Green", 90.0),
("Cyan", 85.0)
对数据分析时,通常需要对数据进行
分组
,并对每个
分组
进行聚合运算。在一定意义上,窗口也是一种
分组
统计的方法。
分组
数据
DataFrame.
group
By()返回的是
Group
edData类,可以对
分组
数据应用聚合函数、apply()函数和pivot()函数。
常用的聚合函数是:
count():统计数量
mean(*cols), avg(*cols):计算均值
max(*cols),min(*...
import
pyspark
from
pyspark
.sql import SQLContext
from
pyspark
.sql.functions import hour, when, col, date_format, to_timestamp
from
pyspark
.sql.functions import *
# Define Spark Context
sc =
pyspark
.SparkContext(appName="Homework")...
原代码:for name in list_valid_perfor_inventory:
time_stamp = time.time()
df_tmp1 = df_all_performance[df_all_performance['res_ins_id'] == name] ###170万行,该语句大约需要2S
if df_tmp1.empty:
co...
python编码报错:UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xbc in position 2: invalid start byt
208417
python编码报错:UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xbc in position 2: invalid start byt
win10安装python包imgaug报错Command "python setup.py egg_info" failed with error code 1 in C:\Users\admi
哦nic3:
安装python的包的四种方式(pip、whl源文件、targz压缩包、zip压缩包)
鹤合家福:
python编码报错:UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xbc in position 2: invalid start byt
CAI2256:
UnicodeEncodeError: ‘utf-8‘ codec can‘t encode character ‘\ud835‘ in position 219: surrogates not al
【Python】文件锁 跨平台和系统支持win和linux
【报错】 jpype._jvmfinder.JVMNotFoundException: No JVM shared library file (jvm.dll) found.