堆叠、取消堆叠、融化、旋转、换位?将多列转换为行(PySpark或Pandas)的简单方法是什么
我的工作环境主要使用PySpark,但在谷歌搜索时,在PySpark中进行转换非常复杂。我想把它保存在PySpark中,但如果在Pandas中更容易实现,我会将Spark数据帧转换为Pandas数据帧。在我认为性能是一个问题的地方,数据集并没有那么大 我想将具有多列的数据帧转换为行: 输入:堆叠、取消堆叠、融化、旋转、换位?将多列转换为行(PySpark或Pandas)的简单方法是什么,pandas,pyspark,pivot,transform,melt,Pandas,Pyspark,Pivot,Transform,Melt,我的工作环境主要使用PySpark,但在谷歌搜索时,在PySpark中进行转换非常复杂。我想把它保存在PySpark中,但如果在Pandas中更容易实现,我会将Spark数据帧转换为Pandas数据帧。在我认为性能是一个问题的地方,数据集并没有那么大 我想将具有多列的数据帧转换为行: 输入: import pandas as pd df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3}, 'Hospital': {0: 'Red Cross', 1: '
import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
Record Hospital Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4
1 Red Cross 1234 Street 429 Effective Effective Normal Effective
2 Alberta Hospital 553 Alberta Road 441 Effecive Normal Normal Effective
3 General Hospital 994 Random Street 923 Normal Effective Normal Effective
输出:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
通过查看PySpark示例,它很复杂:
以熊猫为例,看起来容易多了。但是有许多不同的堆栈溢出答案,有些人说要使用枢轴、熔化、堆栈、取消堆栈,结果更让人困惑
所以,如果有人在PySpark中有一个简单的方法来做这件事,我洗耳恭听。如果没有,我很乐意接受答案
非常感谢你的帮助 这里是使用
堆栈的熊猫
df_final = (df.set_index(['Record', 'Hospital', 'Hospital Address'])
.stack(dropna=False)
.rename('Value')
.reset_index()
.rename({'level_3': 'Name'},axis=1)
.assign(Record=lambda x: x.index+1))
Out[120]:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
您还可以使用.melt
并指定id\u vars
。其他一切都将考虑<代码> ValueVAARS < /代码>。您拥有的value\u var
列数将使数据帧中的行数乘以该数,将四列中的所有列信息叠加到一列中,并将id\u var
列复制到所需格式:
数据帧设置:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
代码:
使用pyspark也很简单/容易做到
在大量列的情况下动态生成查询:
这里的主要思想是动态创建堆栈(x,a,b,c)。我们可以利用python字符串格式来进行动态sring
index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{0},'{1}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()
输出:
+----------------+--------------------+-----------+-------------+
| Hospital| Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
| Red Cross| 1234 Street 429| Medicine_1| Effective|
| Red Cross| 1234 Street 429| Medicine_2| Effective|
| Red Cross| 1234 Street 429| Medicine_3| Normal|
| Red Cross| 1234 Street 429| Medicine_4| Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1| Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4| Effective|
|General Hospital|994 Random Street...| Medicine_1| Normal|
|General Hospital|994 Random Street...| Medicine_2| Effective|
|General Hospital|994 Random Street...| Medicine_3| Normal|
|General Hospital|994 Random Street...| Medicine_4| Effective|
+----------------+--------------------+-----------+-------------+
IIUC,你可以在pyspark中使用explode
。您好,我已经删除了图像并编辑了您的问题,但将来请看:嗨,安迪,你能解释一下为什么在这种情况下使用堆栈而不是pivot/melt吗?@Anonymous:pivot
将值转换为索引和列。您的正在将列转换为值,因此我们不能使用pivot
<代码>熔化
是一个可能的候选项。但是,melt
将按此顺序熔化列Medicine_1、Medicine_1、Medicine_1、Medicine_2、Medicine_2、…
。您需要一个排序\u值
使其Medicine\u 1,Medicine\u 2,Medicine\u 3,Medicine\u 4…
而堆栈
立即返回Medicine\u 1,Medicine\u 2,Medicine\u 3,Medicine\u 1,Medicine\u 2,Medicine\u 3,
。这就是我选择stack
@Anonymous的原因:如果您想在Record
列中使用顺序值,只需链接额外的assign
,将其转换为一个序列,如我更新的回答中所示。hi David,您能解释一下为什么在这个场景中选择使用melt而不是stack/pivot吗?谢谢。@Anonymous您可以使用多种方法获得相同的解决方案,正如您可以在Andy的堆栈中看到的那样。我只是觉得melt
可以稍微干净一点,但是安迪的堆栈式回答也可以。这是3次操作,而不是5次,所以更简洁,也许更高效。嘿,文基,谢谢你让我知道PySpark中的堆栈函数。在我的实际数据框架中,大约有20多种药物。做“医学1”需要很长时间,医学1。是否可以跳过alias而改为列表?@Anonymous我已更新我的答案以处理大量列并动态生成数据。如果堆栈是pyspark函数而不是sql函数,则会更干净/更简单。
index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{0},'{1}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()
+----------------+--------------------+-----------+-------------+
| Hospital| Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
| Red Cross| 1234 Street 429| Medicine_1| Effective|
| Red Cross| 1234 Street 429| Medicine_2| Effective|
| Red Cross| 1234 Street 429| Medicine_3| Normal|
| Red Cross| 1234 Street 429| Medicine_4| Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1| Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3| Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4| Effective|
|General Hospital|994 Random Street...| Medicine_1| Normal|
|General Hospital|994 Random Street...| Medicine_2| Effective|
|General Hospital|994 Random Street...| Medicine_3| Normal|
|General Hospital|994 Random Street...| Medicine_4| Effective|
+----------------+--------------------+-----------+-------------+