Python 在配置单元中,我有一个按时间戳排序的QA数据集(ID、时间、内容、角色)。如何将其转换为(ID、roleA、roleB)等格式?
我想输出如下数据:Python 在配置单元中,我有一个按时间戳排序的QA数据集(ID、时间、内容、角色)。如何将其转换为(ID、roleA、roleB)等格式?,python,sql,hive,user-defined-functions,Python,Sql,Hive,User Defined Functions,我想输出如下数据: ID roleA role B xxx is customer service? yes, how can i help you, how can i help you xxx is customer service? yes xxx great, why this happens wait a mi
ID roleA role B
xxx is customer service? yes, how can i help you, how can i help you
xxx is customer service? yes
xxx great, why this happens wait a minute, let me check
我不知道如何使用sql或python解决它。这是条件聚合的一个缺口和孤岛问题:
select biz_id, send_role, min(create_time) as create_time,
concat_ws(collect_list(content), ' ') as content
from (select t.*,
row_number() over (partition by biz_id order by create_time) as seqnum,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
from t
) t
group by biz_id, send_role, (seqnum - seqnum_2);
然后,通过此项,您可以重新聚合以获得您想要的:
with x as (
select biz_id, send_role, min(create_time) as create_time,
concat_ws(collect_list(content), ' ') as content
from (select t.*,
row_number() over (partition by biz_id order by create_time) as seqnum,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
from t
) t
group by biz_id, send_role, (seqnum - seqnum_2)
)
select biz_id,
max(case when send_role = 2 then content end),
max(case when send_role = 3 then content end)
from (select x.*,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum
from x
) x
group by biz_id, seqnum;
注意:这可能会以任意顺序将内容放在“相邻”行上。把这些按“正确”的顺序排列是很棘手的。在您的示例数据中,日期/时间是相同的,因此没有明显的排序列。太漂亮了!我没有意识到有所谓的“鸿沟和岛屿问题”。非常感谢!