Python 当数据共享一列值时,是否有方法将数据压缩到数据帧的一行中?
我有一个有几千行的数据框。 DF保存组织内各单位的单位标识符和响应时间。 它是在DF中构造的,具有列[“事件#”,“UnitID”,“第一个UnitInRoute”,“第一个UnitArrived”,“第一个UnitThospital”] 同一事件#有许多不同的行,最后我只希望每个事件#有一行,[“First UnitEnroute”、“First UnitArrived”、“First UnitAtHospital]”由具有相同事件的其他行填充 造成这种情况的原因是某个季度末的账单失败,我们需要知道这些不同的事件是否在不同的单位中传播了这3次。不过,我不需要列出单位,只需要从同一事件的其他行中提取第一个非0值 以下是一些示例数据:Python 当数据共享一列值时,是否有方法将数据压缩到数据帧的一行中?,python,pandas,dataframe,indexing,pandas-groupby,Python,Pandas,Dataframe,Indexing,Pandas Groupby,我有一个有几千行的数据框。 DF保存组织内各单位的单位标识符和响应时间。 它是在DF中构造的,具有列[“事件#”,“UnitID”,“第一个UnitInRoute”,“第一个UnitArrived”,“第一个UnitThospital”] 同一事件#有许多不同的行,最后我只希望每个事件#有一行,[“First UnitEnroute”、“First UnitArrived”、“First UnitAtHospital]”由具有相同事件的其他行填充 造成这种情况的原因是某个季度末的账单失败,我们需
Event# Unit First UnitEnroute First UnitArrived First UnitAtHospital
2020000394 37 ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']
2020000394 38 ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']
2020000394 36 ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']
2020000394 39 ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']
2020000617 58 ['1/2/2020', '12:06:13 PM'] ['1/2/2020', '12:07:39 PM'] ['1/2/2020', '12:43:10 PM']
2020000849 74 ['1/2/2020', '6:42:19 PM'] ['1/2/2020', '6:53:53 PM'] ['1/2/2020', '7:28:32 PM']
2020000849 75 ['0'] ['0'] ['0']
2020000927 81 ['0'] ['0'] ['0']
2020000927 80 ['0'] ['0'] ['0']
2020000997 86 ['0'] ['0'] ['0']
2020000997 87 ['0'] ['0'] ['0']
2020001218 99 ['1/3/2020', '11:50:39 AM'] ['1/3/2020', '11:52:40 AM'] ['1/3/2020', '12:29:37 PM']
2020001218 98 ['0'] ['1/3/2020', '11:52:40 AM'] ['0']
2020001255 102 ['1/3/2020', '12:44:30 PM'] ['0'] ['0']
2020001255 103 ['1/3/2020', '12:40:19 PM'] ['0'] ['0']
2020001258 98 ['1/3/2020', '12:49:00 PM'] ['1/3/2020', '12:57:22 PM'] ['1/3/2020', '1:39:03 PM']
2020001258 103 ['0'] ['0'] ['0']
2020001258 104 ['0'] ['0'] ['0']
2020001258 105 ['0'] ['0'] ['0']
这就是我尝试过的:
for row in DF:
compare = list()
for i in DF:
if i[0] == row[0]:
addition = list(i)
compare = compare.append(addition)
print("Compare: {}".format(compare))
return compare
for el in row.index:
whatisit = row[el]
if whatisit == 0:
for item in compare.index:
if item[el] == 0:
return
else:
replacement = item[el]
print("Replacement: {}".format(replacement))
return replacement
row[el] = replacement
return DF
任何方向都是感激的,很抱歉,如果这已经被张贴之前,我花了大量的时间寻找一个潜在的答案。我想我还没有完全发展出直觉,我需要看到任何代码,看看如何将其应用到我的项目中。我不是一个专业的开发人员,我更多的是一个动手做繁重事情的员工哈哈。这里有一个,IIUC
from io import StringIO
import pandas as pd
# create data frame
df = pd.read_csv(StringIO(data), sep='\s\s+', engine='python')
# drop the column `Unit`
df = df.drop(columns='Unit')
# re-shape
df = df.melt(id_vars='Event#', var_name='first_unit', value_name='timestamp')
# drop timestamp == ['0']
mask = df['timestamp'].astype(str) != "['0']"
df = df[mask]
# drop duplicates
df = df.drop_duplicates()
# get min value for each group -- and re-shape
df = (df.groupby(['Event#', 'first_unit'])['timestamp'].min()
.unstack(level='first_unit')
.reset_index()
)
print(df)
first_unit Event# First UnitArrived \
0 2020000394 ['1/1/2020', '10:48:33 PM']
1 2020000617 ['1/2/2020', '12:07:39 PM']
2 2020000849 ['1/2/2020', '6:53:53 PM']
3 2020001218 ['1/3/2020', '11:52:40 AM']
4 2020001255 NaN
5 2020001258 ['1/3/2020', '12:57:22 PM']
first_unit First UnitAtHospital First UnitEnroute
0 ['1/1/2020', '11:45:01 PM'] ['1/1/2020', '10:45:34 PM']
1 ['1/2/2020', '12:43:10 PM'] ['1/2/2020', '12:06:13 PM']
2 ['1/2/2020', '7:28:32 PM'] ['1/2/2020', '6:42:19 PM']
3 ['1/3/2020', '12:29:37 PM'] ['1/3/2020', '11:50:39 AM']
4 NaN ['1/3/2020', '12:40:19 PM']
5 ['1/3/2020', '1:39:03 PM'] ['1/3/2020', '12:49:00 PM']
以下是原始数据(即用于创建数据框):
这是一个有,IIUC
from io import StringIO
import pandas as pd
# create data frame
df = pd.read_csv(StringIO(data), sep='\s\s+', engine='python')
# drop the column `Unit`
df = df.drop(columns='Unit')
# re-shape
df = df.melt(id_vars='Event#', var_name='first_unit', value_name='timestamp')
# drop timestamp == ['0']
mask = df['timestamp'].astype(str) != "['0']"
df = df[mask]
# drop duplicates
df = df.drop_duplicates()
# get min value for each group -- and re-shape
df = (df.groupby(['Event#', 'first_unit'])['timestamp'].min()
.unstack(level='first_unit')
.reset_index()
)
print(df)
first_unit Event# First UnitArrived \
0 2020000394 ['1/1/2020', '10:48:33 PM']
1 2020000617 ['1/2/2020', '12:07:39 PM']
2 2020000849 ['1/2/2020', '6:53:53 PM']
3 2020001218 ['1/3/2020', '11:52:40 AM']
4 2020001255 NaN
5 2020001258 ['1/3/2020', '12:57:22 PM']
first_unit First UnitAtHospital First UnitEnroute
0 ['1/1/2020', '11:45:01 PM'] ['1/1/2020', '10:45:34 PM']
1 ['1/2/2020', '12:43:10 PM'] ['1/2/2020', '12:06:13 PM']
2 ['1/2/2020', '7:28:32 PM'] ['1/2/2020', '6:42:19 PM']
3 ['1/3/2020', '12:29:37 PM'] ['1/3/2020', '11:50:39 AM']
4 NaN ['1/3/2020', '12:40:19 PM']
5 ['1/3/2020', '1:39:03 PM'] ['1/3/2020', '12:49:00 PM']
以下是原始数据(即用于创建数据框):
因为您有几千行,我建议分别处理每一列,然后再将它们合并在一起:
df1 = df[ df['First UnitEnroute']!="['0']" ][['Event#', 'First UnitEnroute']]
df1 = df1[~df1.duplicated(['Event#'])]
df2 = df[ df['First UnitArrived']!="['0']" ][['Event#', 'First UnitArrived']]
df2 = df2[~df2.duplicated(['Event#'])]
df3 = df[ df['First UnitAtHospital']!="['0']" ][['Event#', 'First UnitAtHospital']]
df3 = df3[~df3.duplicated(['Event#'])]
df_result = df1.merge(df2, on = 'Event#', how='left').merge(df3, on = 'Event#', how='left')
通过这种方式(如果我正确理解了这个问题),您可以找到一个或多个第一个单元统计数据没有时间戳的事件。在您的示例中是event 2020001255,因为您有几千行,所以我建议分别处理每一列并再次将它们合并在一起:
df1 = df[ df['First UnitEnroute']!="['0']" ][['Event#', 'First UnitEnroute']]
df1 = df1[~df1.duplicated(['Event#'])]
df2 = df[ df['First UnitArrived']!="['0']" ][['Event#', 'First UnitArrived']]
df2 = df2[~df2.duplicated(['Event#'])]
df3 = df[ df['First UnitAtHospital']!="['0']" ][['Event#', 'First UnitAtHospital']]
df3 = df3[~df3.duplicated(['Event#'])]
df_result = df1.merge(df2, on = 'Event#', how='left').merge(df3, on = 'Event#', how='left')
通过这种方式(如果我正确理解了这个问题),您可以找到一个或多个第一个单元统计数据没有时间戳的事件。您的示例中是事件2020001255谢谢,这看起来是我需要的。我在
“Event”
上的df=df.melt(id_vars='Event',var_name='first_unit',value_name='timestamp')
行遇到一个键错误,知道为什么吗?你能在df.columns中执行'Event.
吗?列名中有嵌入空格吗?KeyError可能意味着您指定为id_var的内容不在df.columns中谢谢,这看起来像是我需要的。我在“Event”
上的df=df.melt(id_vars='Event',var_name='first_unit',value_name='timestamp')
行遇到一个键错误,知道为什么吗?你能在df.columns中执行'Event.
吗?列名中有嵌入空格吗?KeyError可能意味着您指定为id_var的内容不在df列中