Python 如何在pandas中使用apply函数返回多行？_Python_Pandas_Python 3.6_Pandas Groupby

Python 如何在pandas中使用apply函数返回多行？

python pandas

Python 如何在pandas中使用apply函数返回多行？,python,pandas,python-3.6,pandas-groupby,Python,Pandas,Python 3.6,Pandas Groupby,上下文我有一个包含成绩单的数据框。df中的每一行都有一个唯一的ID、记录行和时间戳，每个ID在一天内（或几天内）可以有多个对应关系下面的示例代码我所拥有的： #What I have starting out with. Df is ordered by CustomerID and Timestamp pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-21', '2018-01-2

上下文

我有一个包含成绩单的数据框。df中的每一行都有一个唯一的ID、记录行和时间戳，每个ID在一天内（或几天内）可以有多个对应关系

下面的示例代码

我所拥有的：

#What I have starting out with. Df is ordered by CustomerID and Timestamp pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-21', '2018-01-22', '2018-01-22'], 'Timestamp': ['2018-01-21 16:28:54', '2018-01-21 16:48:54', '2018-01-22 12:18:54', '2018-01-22 12:22:54'], 'Transcript_Line':['How can I help you?', 'I need help with this pandas problem...', 'Did you get that problem resolved?', 'Nope I still suck at pandas']})

#This is the final result pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-22'], 'Transcript_Line': ['How can I help you?\nI need help with this pandas problem...', 'Did you get that problem resolved?\nNope I still suck at pandas']})

我需要什么：

#What I have starting out with. Df is ordered by CustomerID and Timestamp pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-21', '2018-01-22', '2018-01-22'], 'Timestamp': ['2018-01-21 16:28:54', '2018-01-21 16:48:54', '2018-01-22 12:18:54', '2018-01-22 12:22:54'], 'Transcript_Line':['How can I help you?', 'I need help with this pandas problem...', 'Did you get that problem resolved?', 'Nope I still suck at pandas']})

#This is the final result pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-22'], 'Transcript_Line': ['How can I help you?\nI need help with this pandas problem...', 'Did you get that problem resolved?\nNope I still suck at pandas']})

我需要组织和合并所有抄本（每行中的字符串），对应于同一天（按顺序）

这是我迄今为止尝试过的 问题在于：

def concatConvos(x): if len(set(x.Date)) == 1: return pd.Series({'Email' : x['CustomerID'].values[0], 'Date': x['Date'].values[0], 'Conversation' : '\n'.join(x['Transcript_Line'])}) else: rows = [] for date in set(x.Date): rows.append(pd.Series({'Email': x['CustomerID'].values[0], 'Date': date, 'Conversation': '\n'.join(x[x.Date == date].Transcript_Line)})) return tuple(rows) data3 = data2.groupby('CustomerID').apply(concatConvos)
我能够在客户只有一个通信日期的情况下使用此功能（这意味着他没有多次联系，第一个案例）
如果我试图处理多于1个的情况，那么很可能会出现属性错误，因为函数返回多个series对象

有更简单的方法吗？
这不是最漂亮的解决方案，也不是最有效的解决方案，但我过去也用过类似的方法。我相信可能会有一个更有效的解决方案，而不是使用循环。我将为您提供原始代码，然后逐步分解：

transcript_join = df.groupby(['CustomerID', 'Date']).apply(lambda f: f['Transcript_Line'].values.tolist()).to_dict() for x in transcript_join.keys(): df.loc[(df['CustomerID']==x[0]) & (df['Date'] == x[1]), 'Combine'] = '\n'.join(transcript_join.get(x)) df.drop_duplicates(df.iloc[:,[0,1,2,5]]) # output below AgentID CustomerID Date Timestamp Transcript_Line Combine 0 0 1 2018-01-21 2018-01-21 16:28:54 How can I help you? How can I help you?\nI need help with this pan... 2 0 1 2018-01-22 2018-01-22 12:18:54 Did you get that problem resolved? Did you get that problem resolved?\nNope I sti...
首先，我用变量
transcript\u join
创建一个包含所有响应的字典。关键是客户ID，然后是日期。该值是成绩单的列表
然后，我循环遍历这些键，得到客户ID和日期在字典中相同的位置，并使用
.join
将成绩单合并到一个新列中

最后，我删除了重复项，因为现在将有重复项，因为每个客户ID和日期对将包含相同的
Combine
列。我使用
iloc
删除输出中不需要的列，例如原始
Transcript
列以及
Timestamp
，您应该能够使用groupby实现这一点。这是您的原始数据帧。为了方便起见，我把它命名为df

df = pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-21', '2018-01-22', '2018-01-22'], 'Timestamp': ['2018-01-21 16:28:54', '2018-01-21 16:48:54', '2018-01-22 12:18:54', '2018-01-22 12:22:54'], 'Transcript_Line':['How can I help you?', 'I need help with this pandas problem...', 'Did you get that problem resolved?', 'Nope I still suck at pandas']})
我有点不清楚您是否需要同时对AgentID和CustomerID进行排序，或者只对其中一个进行排序，但希望您能看到如何修改它
初始排序保证转录本_行将按顺序排列。groupby然后查找同一天相同AgentID和CustomerID的所有声明集。as_index=False为您提供输出中列的正确格式。您想要的输出是组合成绩单行，您可以使用sum来完成

df.sort_values(by=['AgentID', 'CustomerID', 'Timestamp']).groupby(['AgentID', 'CustomerID', 'Date'], as_index=False)['Transcript_Line'].sum()
如果您确实需要它们之间的“\n”字符，那么您可以先将它们添加到每个转录本行中，执行与上面相同的groupby操作，然后删除组合字符串末尾的字符

df['Transcript_Line'] = df['Transcript_Line'] + '\n' grouped = df.sort_values(by=['AgentID', 'CustomerID', 'Timestamp']).groupby(['AgentID', 'CustomerID', 'Date'], as_index=False)['Transcript_Line'].sum() grouped['Transcript_Line'] = grouped['Transcript_Line'].apply(lambda x: x[:-1])

您是否可以共享数据框的一个示例，其中显示您的问题和预期输出？有很多的文字，有些人（我也是）可能很难理解。是的…给我一秒钟。。。有一些PII信息我需要过滤掉。我更新了——使用了解析前的df示例。我不使用图片，而是粘贴文本。阅读，它可能会有帮助：）这是一个伟大的编辑，继续张贴这样的问题！如果你第一次以这种方式发布，你更有可能得到可靠的答案：）老实说。。。如果有人发布了另一个解决方案，我会非常感激，因为这将使我的一个流程更快。我只是希望我们有一个更快的解决方案。我觉得groupby方法应该可以工作，并且应该能够通过apply进行操作，以获得我们的结果。幸运的是，我仍在努力实现它。这个数据集大约有50万条记录，所以我试图把所有的数据都分块解析