|
|
潇洒的书签 · vue微前端搭建(微前端-qiankun:v ...· 2 月前 · |
|
|
八块腹肌的柑橘 · Go程序运行时,输出“exec:notsta ...· 1 年前 · |
|
|
干练的消炎药 · 真正的家人级推荐:游戏爱好者趁现在入手HKC ...· 2 年前 · |
|
|
坚韧的哑铃 · 投资者购买分红保险应避免“三个误区”· 2 年前 · |
|
|
礼貌的香菇 · 我们的故事—超过25年的经验· 2 年前 · |
| 升学考试 杭州小学排名 |
| https://www.baidu.com/from=844b/ssid=0/s?word=%E6%9D%AD%E5%B7%9E%E7%A7%81%E7%AB%8B%E5%B0%8F%E5%AD%A6%E6%8E%92%E5%90%8D%E5%AD%A6%E8 |
|
|
鬼畜的充值卡
1 年前 |
Maps each group of the current
DataFrame
using a pandas udf and returns the result
as a
DataFrame
.
The function should take a
pandas.DataFrame
and return another
pandas.DataFrame
. Alternatively, the user can pass a function that takes
a tuple of the grouping key(s) and a
pandas.DataFrame
.
For each group, all columns are passed together as a
pandas.DataFrame
to the user-function and the returned
pandas.DataFrame
are combined as a
DataFrame
.
The
schema
should be a
StructType
describing the schema of the returned
pandas.DataFrame
. The column labels of the returned
pandas.DataFrame
must either match
the field names in the defined schema if specified as strings, or match the
field data types by position if not strings, e.g. integer indices.
The length of the returned
pandas.DataFrame
can be arbitrary.
New in version 3.0.0.
Changed in version 3.4.0: Support Spark Connect.
a Python native function that takes a pandas.DataFrame and outputs a pandas.DataFrame , or that takes one tuple (grouping keys) and a pandas.DataFrame and outputs a pandas.DataFrame .
pyspark.sql.types.DataType
or str
the return type of the
func
in PySpark. The value can be either a
pyspark.sql.types.DataType
object or a DDL-formatted type string.
Notes
This function requires a full shuffle. All the data of a group will be loaded into memory, so the user should be aware of the potential OOM risk if data is skewed and certain groups are too large to fit in memory.
Examples
>>> import pandas as pd
>>> from pyspark.sql.functions import ceil
>>> df = spark.createDataFrame(
... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))
>>> def normalize(pdf):
... v = pdf.v
... return pdf.assign(v=(v - v.mean()) / v.std())
>>> df.groupby("id").applyInPandas(
... normalize, schema="id long, v double").show()
+---+-------------------+
| id| v|
+---+-------------------+
| 1|-0.7071067811865475|
| 1| 0.7071067811865475|
| 2|-0.8320502943378437|
| 2|-0.2773500981126146|
| 2| 1.1094003924504583|
+---+-------------------+
Alternatively, the user can pass a function that takes two arguments.
In this case, the grouping key(s) will be passed as the first argument and the data will
be passed as the second argument. The grouping key(s) will be passed as a tuple of numpy
data types, e.g., numpy.int32 and numpy.float64. The data will still be passed in
as a pandas.DataFrame containing all columns from the original Spark DataFrame.
This is useful when the user does not want to hardcode grouping key(s) in the function.
>>> df = spark.createDataFrame(
... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))
>>> def mean_func(key, pdf):
... # key is a tuple of one numpy.int64, which is the value
... # of 'id' for the current group
... return pd.DataFrame([key + (pdf.v.mean(),)])
>>> df.groupby('id').applyInPandas(
... mean_func, schema="id long, v double").show()
+---+---+
| id| v|
+---+---+
| 1|1.5|
| 2|6.0|
+---+---+
>>> def sum_func(key, pdf):
... # key is a tuple of two numpy.int64s, which is the values
... # of 'id' and 'ceil(df.v / 2)' for the current group
... return pd.DataFrame([key + (pdf.v.sum(),)])
>>> df.groupby(df.id, ceil(df.v / 2)).applyInPandas(
... sum_func, schema="id long, `ceil(v / 2)` long, v double").show()
+---+-----------+----+
| id|ceil(v / 2)| v|
+---+-----------+----+
| 2| 5|10.0|
| 1| 1| 3.0|
| 2| 3| 5.0|
| 2| 2| 3.0|
+---+-----------+----+