Python 在配置单元中,我有一个按时间戳排序的QA数据集(ID、时间、内容、角色)。如何将其转换为(ID、roleA、roleB)等格式?

Python 在配置单元中,我有一个按时间戳排序的QA数据集(ID、时间、内容、角色)。如何将其转换为(ID、roleA、roleB)等格式?,python,sql,hive,user-defined-functions,Python,Sql,Hive,User Defined Functions,我想输出如下数据: ID roleA role B xxx is customer service? yes, how can i help you, how can i help you xxx is customer service? yes xxx great, why this happens wait a mi

我想输出如下数据:

    ID      roleA                           role B
    xxx     is customer service?            yes, how can i help you, how can i help you
    xxx     is customer service?            yes
    xxx     great, why this happens         wait a minute, let me check

我不知道如何使用sql或python解决它。

这是条件聚合的一个缺口和孤岛问题:

select biz_id, send_role, min(create_time) as create_time,
       concat_ws(collect_list(content), ' ') as content
from (select t.*,
             row_number() over (partition by biz_id order by create_time) as seqnum,
             row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
      from t
     ) t
group by biz_id, send_role, (seqnum - seqnum_2);
然后,通过此项,您可以重新聚合以获得您想要的:

with x as (
      select biz_id, send_role, min(create_time) as create_time,
             concat_ws(collect_list(content), ' ') as content
      from (select t.*,
                   row_number() over (partition by biz_id order by create_time) as seqnum,
                   row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
            from t
           ) t
      group by biz_id, send_role, (seqnum - seqnum_2)
     )
select biz_id,
       max(case when send_role = 2 then content end),
       max(case when send_role = 3 then content end)
from (select x.*,
             row_number() over (partition by biz_id, send_role order by create_time) as seqnum
      from x
     ) x
group by biz_id, seqnum;

注意:这可能会以任意顺序将内容放在“相邻”行上。把这些按“正确”的顺序排列是很棘手的。在您的示例数据中,日期/时间是相同的,因此没有明显的排序列。

太漂亮了!我没有意识到有所谓的“鸿沟和岛屿问题”。非常感谢!