Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在Spark数据帧中按组/分区重命名列?_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何在Spark数据帧中按组/分区重命名列?

Python 如何在Spark数据帧中按组/分区重命名列?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一些传感器数据是按通道名称存储在表中的,而不是传感器名称(这是为了避免有非常宽的表,因为许多传感器只在少数设备上使用-我知道这是稀疏列的工作,但我只是数据的用户)。大概是这样的: from functools import reduce import numpy as np import pandas as pd np.random.seed(0) data_df = pd.DataFrame({ 'id': ['a']*5 + ['b']*5 + ['c']*5, '

我有一些传感器数据是按通道名称存储在表中的,而不是传感器名称(这是为了避免有非常宽的表,因为许多传感器只在少数设备上使用-我知道这是稀疏列的工作,但我只是数据的用户)。大概是这样的:

from functools import reduce

import numpy as np
import pandas as pd

np.random.seed(0)

data_df = pd.DataFrame({
    'id': ['a']*5 + ['b']*5 + ['c']*5,
    'chan1': range(15),
    'chan2': np.random.uniform(0, 10, size=15),
    'chan3': np.random.uniform(0, 100, size=15)
})
第二个表告诉我们如何根据设备的特定ID将通道名称映射到传感器名称:

sensor_channel_df = pd.DataFrame([
    {'id': 'a', 'channel': 'chan1', 'sensor': 'weight'},
    {'id': 'a', 'channel': 'chan2', 'sensor': 'torque'},
    {'id': 'a', 'channel': 'chan3', 'sensor': 'temp'},
    {'id': 'b', 'channel': 'chan1', 'sensor': 'weight'},
    {'id': 'b', 'channel': 'chan2', 'sensor': 'temp'},
    {'id': 'b', 'channel': 'chan3', 'sensor': 'speed'},
    {'id': 'c', 'channel': 'chan1', 'sensor': 'temp'},
    {'id': 'c', 'channel': 'chan2', 'sensor': 'weight'},
    {'id': 'c', 'channel': 'chan3', 'sensor': 'acceleration'},
])
我可以创建一个重命名字典,如下所示:

channel_rename_dict = sensor_channel_df.groupby('id')\
                                       .apply(lambda grp: dict(zip(grp['channel'], grp['sensor'])))\
                                       .to_dict()
然后用另一个
groupby
/
apply
重命名所有列:

data_df.groupby('id')\
       .apply(lambda group: group.rename(columns=channel_rename_dict[group.name]))\
       .reset_index(level=0, drop=True)
grp = sensor_channel_df.groupby("id").get_group("a")
我们得到的结果如下:

    acceleration id      speed       temp    torque    weight
0            NaN  a        NaN   8.712930  5.488135  0.000000
1            NaN  a        NaN   2.021840  7.151894  1.000000
2            NaN  a        NaN  83.261985  6.027634  2.000000
3            NaN  a        NaN  77.815675  5.448832  3.000000
4            NaN  a        NaN  87.001215  4.236548  4.000000
5            NaN  b  97.861834   6.458941       NaN  5.000000
6            NaN  b  79.915856   4.375872       NaN  6.000000
7            NaN  b  46.147936   8.917730       NaN  7.000000
8            NaN  b  78.052918   9.636628       NaN  8.000000
9            NaN  b  11.827443   3.834415       NaN  9.000000
10     63.992102  c        NaN  10.000000       NaN  7.917250
11     14.335329  c        NaN  11.000000       NaN  5.288949
12     94.466892  c        NaN  12.000000       NaN  5.680446
13     52.184832  c        NaN  13.000000       NaN  9.255966
14     41.466194  c        NaN  14.000000       NaN  0.710361
这一切都很好(尽管我知道在熊猫身上有更好的方法也不足为奇),我用它向一些同事展示了这个过程的逻辑

然而,对于项目架构,我们决定使用spark。有没有办法在Spark数据帧中实现同样的行为

我最初的想法是先
缓存
完整的
数据_df
,然后用
过滤器将
id
上的数据帧分解。例如,假设
data\u df
现在是一个spark数据帧:

data_df.cache()
unique_ids = data_df.select('id').distinct().rdd.map(lambda row: row[0]).collect()
split_dfs = {id: data_df.filter(data_df['id'] == id) for id in unique_ids}
然后,如果我们像以前一样使用列重命名字典,我们可以执行以下操作:

dfs_paired_with_rename_tuple_lists = [
    (split_dfs[id], list(channel_rename_dict[id].items()))
    for id in unique_ids
]

new_dfs = [
    reduce(lambda df_i, rename_tuple: df_i.withColumnRenamed(*rename_tuple), rename_tuple_list, df)
    for df, rename_tuple_list in dfs_paired_with_rename_tuple_lists
]
然后,在确保spark数据帧具有公共列之后,我可以使用
Union()
在此spark数据帧列表上执行
reduce


我的感觉是这将非常缓慢,而且可能有更好的方法来实现这一点。

首先,让我们重新定义映射到group by
channel
并返回
MapType
Column
(很方便,但可以用
itertools.chain
替换)*:

接下来,获取传感器列表:

sensors = sorted(sensor_channel_df["sensor"].unique().tolist())
并组合数据列:

df = spark.createDataFrame(data_df)
data_cols = struct(*[c for c in df.columns if c != "id"])
上面定义的组件可以组合:

cols = [channel_map[col("id")][sensor].alias(sensor) for sensor in sensors]

df.select(["id"] + cols)
+---+------------------+------------------+------------------+------------------+------------------+
|id |加速度|速度|温度|扭矩|重量|
+---+------------------+------------------+------------------+------------------+------------------+
|a |空|空| 8.7129970154072 | 5.4881350392732475 | 0.0|
|a | null | null | 2.021839744032572 | 7.151893663724195 | 1.0|
|a | null | null | 83.2619845547938 | 6.027633760716439 | 2.0|
|a |空|空| 77.81567509498505 | 5.448831829968968 | 3.0|
|a | null | null | 87.00121482468191 | 4.236547993389047 | 4.0|
|b | null | 97.8618342232764 | 6.458941130666561 | null | 5.0|
|b | null | 79.91585642167236 | 4.375872112626925 | null | 6.0|
|b |空| 46.147936225293186 | 8.917730007820797 |空| 7.0|
|b |空| 78.05291762864555 | 9.636627605010293 |空| 8.0|
|b | null | 11.827442586893323 | 3.8344151882577773 | null | 9.0|
|c | 63.99210213275238 | null | 10.0 | null | 7.917250380826646|
|c | 14.33532874090464 | null | 11.0 | null | 5.288949197529044|
|c | 94.46689170495839 | null | 12.0 | null | 5.680445610939323|
|c | 52.18483217507166 | null | 13.0 | null | 9.25596638292661|
|c | 41.46619399905236 |空| 14.0 |空| 0.7103605819788694|
+---+------------------+------------------+------------------+------------------+------------------+
也可以使用
udf
,尽管效率较低:

from toolz import unique
from pyspark.sql.types import *
from pyspark.sql.functions import udf

channel_dict = (sensor_channel_df
    .groupby("id")
    .apply(lambda grp: dict(zip(grp["sensor"], grp["channel"])))
    .to_dict())

def remap(d):
    fields = sorted(unique(concat(_.keys() for _ in d.values())))
    schema = StructType([StructField(f, DoubleType()) for f in fields])
    def _(row, id):
        return tuple(float(row[d[id].get(f)]) if d[id].get(f) is not None 
                     else None for f in fields)
    return udf(_, schema)

(df
    .withColumn("vals", remap(channel_dict)(data_cols, "id"))
    .select("id", "vals.*"))
+---+------------------+------------------+------------------+------------------+------------------+
|id |加速度|速度|温度|扭矩|重量|
+---+------------------+------------------+------------------+------------------+------------------+
|a |空|空| 8.7129970154072 | 5.4881350392732475 | 0.0|
|a | null | null | 2.021839744032572 | 7.151893663724195 | 1.0|
|a | null | null | 83.2619845547938 | 6.027633760716439 | 2.0|
|a |空|空| 77.81567509498505 | 5.448831829968968 | 3.0|
|a | null | null | 87.00121482468191 | 4.236547993389047 | 4.0|
|b | null | 97.8618342232764 | 6.458941130666561 | null | 5.0|
|b | null | 79.91585642167236 | 4.375872112626925 | null | 6.0|
|b |空| 46.147936225293186 | 8.917730007820797 |空| 7.0|
|b |空| 78.05291762864555 | 9.636627605010293 |空| 8.0|
|b | null | 11.827442586893323 | 3.8344151882577773 | null | 9.0|
|c | 63.99210213275238 | null | 10.0 | null | 7.917250380826646|
|c | 14.33532874090464 | null | 11.0 | null | 5.288949197529044|
|c | 94.46689170495839 | null | 12.0 | null | 5.680445610939323|
|c | 52.18483217507166 | null | 13.0 | null | 9.25596638292661|
|c | 41.46619399905236 |空| 14.0 |空| 0.7103605819788694|
+---+------------------+------------------+------------------+------------------+------------------+
在Spark 2.3或更高版本中,您可以将当前代码应用于


*了解发生了什么
grp = sensor_channel_df.groupby("id").get_group("a")
keys = list(map(lit, grp["sensor"]))
keys
values = list(map(col, grp["channel"]))
values
df_ = df.drop_duplicates(subset=["id"])

df_.select(keys).show()
df_.select(values).show(3)
mapping = create_map(*interleave([keys, values]))
mapping 
df_.select(mapping).show(3, False)
Column<b'map(a, map(weight, chan1, torque, chan2, temp, chan3), b, map(weight, chan1, temp, chan2, speed, chan3), c, map(temp, chan1, weight, chan2, acceleration, chan3))'>
df_.select(channel_map.alias("channel_map")).show(3, False)
df_.select(channel_map[col("id")].alias("data_mapping")).show(3, False)
df_.select(channel_map[col("id")]["weight"].alias("weight")).show(3, False)