堆叠、取消堆叠、融化、旋转、换位?将多列转换为行(PySpark或Pandas)的简单方法是什么

堆叠、取消堆叠、融化、旋转、换位?将多列转换为行(PySpark或Pandas)的简单方法是什么,pandas,pyspark,pivot,transform,melt,Pandas,Pyspark,Pivot,Transform,Melt,我的工作环境主要使用PySpark,但在谷歌搜索时,在PySpark中进行转换非常复杂。我想把它保存在PySpark中,但如果在Pandas中更容易实现,我会将Spark数据帧转换为Pandas数据帧。在我认为性能是一个问题的地方,数据集并没有那么大 我想将具有多列的数据帧转换为行: 输入: import pandas as pd df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3}, 'Hospital': {0: 'Red Cross', 1: '

我的工作环境主要使用PySpark,但在谷歌搜索时,在PySpark中进行转换非常复杂。我想把它保存在PySpark中,但如果在Pandas中更容易实现,我会将Spark数据帧转换为Pandas数据帧。在我认为性能是一个问题的地方,数据集并没有那么大

我想将具有多列的数据帧转换为行:

输入:

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})

Record          Hospital       Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4  
     1         Red Cross        1234 Street 429  Effective  Effective     Normal  Effective    
     2  Alberta Hospital   553 Alberta Road 441   Effecive     Normal     Normal  Effective
     3  General Hospital  994 Random Street 923     Normal  Effective     Normal  Effective
输出:

    Record          Hospital       Hospital Address        Name      Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective
通过查看PySpark示例,它很复杂:

以熊猫为例,看起来容易多了。但是有许多不同的堆栈溢出答案,有些人说要使用枢轴、熔化、堆栈、取消堆栈,结果更让人困惑

所以,如果有人在PySpark中有一个简单的方法来做这件事,我洗耳恭听。如果没有,我很乐意接受答案


非常感谢你的帮助

这里是使用
堆栈的熊猫

df_final =  (df.set_index(['Record', 'Hospital', 'Hospital Address'])
               .stack(dropna=False)
               .rename('Value')
               .reset_index()
               .rename({'level_3': 'Name'},axis=1)
               .assign(Record=lambda x: x.index+1))

Out[120]:
    Record          Hospital       Hospital Address       Name       Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

您还可以使用
.melt
并指定
id\u vars
。其他一切都将考虑<代码> ValueVAARS < /代码>。您拥有的
value\u var
列数将使数据帧中的行数乘以该数,将四列中的所有列信息叠加到一列中,并将
id\u var
列复制到所需格式:

数据帧设置:

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
代码:


使用pyspark也很简单/容易做到

在大量列的情况下动态生成查询

这里的主要思想是动态创建堆栈(x,a,b,c)。我们可以利用python字符串格式来进行动态sring

index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c  for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{0},'{1}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()
输出:

+----------------+--------------------+-----------+-------------+
|        Hospital|    Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
|       Red Cross|     1234 Street 429| Medicine_1|    Effective|
|       Red Cross|     1234 Street 429| Medicine_2|    Effective|
|       Red Cross|     1234 Street 429| Medicine_3|       Normal|
|       Red Cross|     1234 Street 429| Medicine_4|    Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1|     Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4|    Effective|
|General Hospital|994 Random Street...| Medicine_1|       Normal|
|General Hospital|994 Random Street...| Medicine_2|    Effective|
|General Hospital|994 Random Street...| Medicine_3|       Normal|
|General Hospital|994 Random Street...| Medicine_4|    Effective|
+----------------+--------------------+-----------+-------------+

IIUC,你可以在pyspark中使用
explode
。您好,我已经删除了图像并编辑了您的问题,但将来请看:嗨,安迪,你能解释一下为什么在这种情况下使用堆栈而不是pivot/melt吗?@Anonymous:
pivot
将值转换为索引和列。您的正在将列转换为值,因此我们不能使用
pivot
<代码>熔化
是一个可能的候选项。但是,
melt
将按此顺序熔化列
Medicine_1、Medicine_1、Medicine_1、Medicine_2、Medicine_2、…
。您需要一个
排序\u值
使其
Medicine\u 1,Medicine\u 2,Medicine\u 3,Medicine\u 4…
堆栈
立即返回
Medicine\u 1,Medicine\u 2,Medicine\u 3,Medicine\u 1,Medicine\u 2,Medicine\u 3,
。这就是我选择
stack
@Anonymous的原因:如果您想在
Record
列中使用顺序值,只需链接额外的
assign
,将其转换为一个序列,如我更新的回答中所示。hi David,您能解释一下为什么在这个场景中选择使用melt而不是stack/pivot吗?谢谢。@Anonymous您可以使用多种方法获得相同的解决方案,正如您可以在Andy的
堆栈中看到的那样。我只是觉得
melt
可以稍微干净一点,但是安迪的堆栈式回答也可以。这是3次操作,而不是5次,所以更简洁,也许更高效。嘿,文基,谢谢你让我知道PySpark中的堆栈函数。在我的实际数据框架中,大约有20多种药物。做“医学1”需要很长时间,医学1。是否可以跳过alias而改为列表?@Anonymous我已更新我的答案以处理大量列并动态生成数据。如果堆栈是pyspark函数而不是sql函数,则会更干净/更简单。
index_cols= ["Hospital","Hospital Address"]
drop_cols = ['Record']
# Select all columns which needs to be pivoted down
pivot_cols = [c  for c in df.columns if c not in index_cols+drop_cols ]
# Create a dynamic stackexpr in this case we are generating stack(4,'{0}',{0},'{1}',{1}...)
# " '{0}',{0},'{1}',{1}".format('Medicine1','Medicine2') = "'Medicine1',Medicine1,'Medicine2',Medicine2"
# which is similiar to what we have previously
stackexpr = "stack("+str(len(pivot_cols))+","+",".join(["'{"+str(i)+"}',{"+str(i)+"}" for i in range(len(pivot_cols))]) +")"
df.selectExpr(*index_cols,stackexpr.format(*pivot_cols) ).show()
+----------------+--------------------+-----------+-------------+
|        Hospital|    Hospital Address|MedicinName|Effectiveness|
+----------------+--------------------+-----------+-------------+
|       Red Cross|     1234 Street 429| Medicine_1|    Effective|
|       Red Cross|     1234 Street 429| Medicine_2|    Effective|
|       Red Cross|     1234 Street 429| Medicine_3|       Normal|
|       Red Cross|     1234 Street 429| Medicine_4|    Effective|
|Alberta Hospital|553 Alberta Road 441| Medicine_1|     Effecive|
|Alberta Hospital|553 Alberta Road 441| Medicine_2|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_3|       Normal|
|Alberta Hospital|553 Alberta Road 441| Medicine_4|    Effective|
|General Hospital|994 Random Street...| Medicine_1|       Normal|
|General Hospital|994 Random Street...| Medicine_2|    Effective|
|General Hospital|994 Random Street...| Medicine_3|       Normal|
|General Hospital|994 Random Street...| Medicine_4|    Effective|
+----------------+--------------------+-----------+-------------+