Python-Pandas,将变量长度列表聚合到一个整洁的数据集中
我有以下数据框,每一行都是事件名称的字符串:Python-Pandas,将变量长度列表聚合到一个整洁的数据集中,python,pandas,dataset,Python,Pandas,Dataset,我有以下数据框,每一行都是事件名称的字符串: 0 event_1 1 other_event 2 other_event, other_event, other_event, other_e... 3 event_3, other_event, other_event, other_event... 4
0 event_1
1 other_event
2 other_event, other_event, other_event, other_e...
3 event_3, other_event, other_event, other_event...
4 some_event, other_event
5 event_1, event_5, some_event, some_event, some...
6 event_5, event_6, other_event
7 event_1
我想拆分每一行,按事件名称聚合,并创建一个整洁的数据集,如下所示:
+---+--------+------------+--------+-----------+--------+--------+
|id |event_1 |other_event |event_3 |some_event |event_5 |event_6 |
+---+--------+------------+--------+-----------+--------+--------+
|0 |1 |0 |0 |0 |0 |0 |
+---+--------+------------+--------+-----------+--------+--------+
|1 |0 |1 |0 |0 |0 |0 |
+---+--------+------------+--------+-----------+--------+--------+
|2 |0 |4 |0 |0 |0 |0 |
+---+--------+------------+--------+-----------+--------+--------+
|3 |0 |3 |1 |0 |0 |0 |
+---+--------+------------+--------+-----------+--------+--------+
|4 |0 |1 |0 |1 |0 |0 |
+---+--------+------------+--------+-----------+--------+--------+
|5 |1 |0 |0 |3 |1 |0 |
+---+--------+------------+--------+-----------+--------+--------+
|6 |0 |1 |0 |0 |1 |1 |
+---+--------+------------+--------+-----------+--------+--------+
|7 |1 |0 |0 |0 |0 |0 |
+---+--------+------------+--------+-----------+--------+--------+
我曾经使用过
df[“events_array”].str.split(“,”
),但是被卡住了,任何帮助都会被显示出来:)第一个想法是在列表字典的列表理解中使用计数器
,并传递到数据帧
构造函数,替换缺少的值并转换为整数:
from collections import Counter
df = pd.DataFrame([Counter(x.split(", ")) for x in df["events_array"]]).fillna(0).astype(int)
print (df)
event_1 other_event event_3 some_event event_5 event_6
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 4 0 0 0 0
3 0 3 1 0 0 0
4 0 1 0 1 0 0
5 1 0 0 3 1 0
6 0 1 0 0 1 1
7 1 0 0 0 0 0
或者可以通过和expand=True
创建数据帧,然后通过value\u counts
对apply
中的每行进行计数:
df = (df["events_array"].str.split(', ', expand=True)
.apply(pd.value_counts, 1)
.fillna(0)
.astype(int)
)
print (df)
event_1 event_3 event_5 event_6 other_event some_event
0 1 0 0 0 0 0
1 0 0 0 0 1 0
2 0 0 0 0 4 0
3 0 1 0 0 3 0
4 0 0 0 0 1 1
5 1 0 1 0 0 3
6 0 0 1 1 1 0
7 1 0 0 0 0 0
谢谢,这正是我想要的:)你能看看这里吗:?