如何使用SQL为包含相同数据集的所有组生成标识符?

如何使用SQL为包含相同数据集的所有组生成标识符?,sql,oracle,Sql,Oracle,我有一个表格,包含公交线路时刻表,如下所示: SELECT NULL AS CALENDARDATE ,NULL AS BUS_ID ,NULL AS STOP_ID ,NULL AS TIME FROM DUAL UNION SELECT '20190101', 100, 'STOP_1000', '8:30' FROM DUAL UNION SELECT '2019

我有一个表格,包含公交线路时刻表,如下所示:

    SELECT
            NULL AS CALENDARDATE
           ,NULL AS BUS_ID
           ,NULL AS STOP_ID
           ,NULL AS TIME
      FROM DUAL UNION
    SELECT '20190101',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT '20190101',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT '20190101',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT '20190101',  100, 'STOP_1001', '8:37' FROM DUAL UNION

    SELECT '20190102',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT '20190102',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT '20190102',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT '20190102',  100, 'STOP_1001', '8:37' FROM DUAL UNION

    SELECT '20190103',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT '20190103',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT '20190103',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT '20190103',  100, 'STOP_1001', '8:37' FROM DUAL UNION
    SELECT '20190103',  100, 'STOP_1003', '8:39' FROM DUAL UNION

    SELECT '20190104',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT '20190104',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT '20190104',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT '20190104',  100, 'STOP_1003', '8:37' FROM DUAL UNION

    SELECT '20190105',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT '20190105',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT '20190105',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT '20190105',  100, 'STOP_1001', '8:37' FROM DUAL UNION

    SELECT '20190101',  101, 'STOP_1002', '8:20' FROM DUAL UNION
    SELECT '20190101',  101, 'STOP_1002', '8:30' FROM DUAL UNION
    SELECT '20190101',  101, 'STOP_1002', '8:40' FROM DUAL UNION
    SELECT '20190101',  101, 'STOP_1002', '8:50' FROM DUAL UNION

    SELECT '20190102',  101, 'STOP_1002', '8:22' FROM DUAL UNION
    SELECT '20190102',  101, 'STOP_1002', '8:30' FROM DUAL UNION
    SELECT '20190102',  101, 'STOP_1002', '8:40' FROM DUAL UNION
    SELECT '20190102',  101, 'STOP_1002', '8:50' FROM DUAL UNION

    SELECT '20190103',  101, 'STOP_1002', '8:20' FROM DUAL UNION
    SELECT '20190103',  101, 'STOP_1002', '8:30' FROM DUAL UNION
    SELECT '20190103',  101, 'STOP_1002', '8:40' FROM DUAL UNION
    SELECT '20190103',  101, 'STOP_1002', '8:50' FROM DUAL ;
我们的目标是压缩数据,我们将为每天创建一个计划id,而不是冗余数据

也就是说,对于同一天,对于同一条公交线路,如果它的所有路线都在同一时间经过相同的站点,那么我们应该为这一天生成相同的计划id,并且每次生成相同的公交组合

我已经尝试通过两个循环来解决这个问题,外部循环每天都在进行,每个公交线路的内部循环,但是由于数据量巨大,需要花费太多的时间来完成。(每年约5亿美元)

我知道我可以在SQL Server中使用XML Path来实现这一点,但这在Oracle中不起作用,我确信Oracle中肯定还有其他elegent解决方案,但我对Oracle还是新手,所以我在这里寻求帮助

环境是Oracle 11G

预期结果:

    SELECT
            NULL AS SCHEDUL_ID
           ,NULL AS CalendarDate
           ,NULL AS BUS_ID
           ,NULL AS STOP_ID
           ,NULL AS TIME
        FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190101',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190101',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190101',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190101',  100, 'STOP_1001', '8:37' FROM DUAL UNION

    SELECT 'SCHEDULE_2019_100_01', '20190102',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190102',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190102',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190102',  100, 'STOP_1001', '8:37' FROM DUAL UNION

    SELECT 'SCHEDULE_2019_100_02', '20190103',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_02', '20190103',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_02', '20190103',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_02', '20190103',  100, 'STOP_1001', '8:37' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_02', '20190103',  100, 'STOP_1003', '8:39' FROM DUAL UNION

    SELECT 'SCHEDULE_2019_100_03', '20190104',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_03', '20190104',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_03', '20190104',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_03', '20190104',  100, 'STOP_1003', '8:37' FROM DUAL UNION

    SELECT 'SCHEDULE_2019_100_01', '20190105',  100, 'STOP_1000', '8:30' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190105',  100, 'STOP_1000', '8:35' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190105',  100, 'STOP_1001', '8:32' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_100_01', '20190105',  100, 'STOP_1001', '8:37' FROM DUAL UNION

    SELECT 'SCHEDULE_2019_101_01', '20190101',  101, 'STOP_1002', '8:20' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_101_01', '20190101',  101, 'STOP_1002', '8:30' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_101_01', '20190101',  101, 'STOP_1002', '8:40' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_101_01', '20190101',  101, 'STOP_1002', '8:50' FROM DUAL UNION

    SELECT 'SCHEDULE_2019_101_02', '20190102',  101, 'STOP_1002', '8:22' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_101_02', '20190102',  101, 'STOP_1002', '8:30' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_101_02', '20190102',  101, 'STOP_1002', '8:40' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_101_02', '20190102',  101, 'STOP_1002', '8:50' FROM DUAL UNION

    SELECT 'SCHEDULE_2019_101_01', '20190103',  101, 'STOP_1002', '8:20' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_101_01', '20190103',  101, 'STOP_1002', '8:30' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_101_01', '20190103',  101, 'STOP_1002', '8:40' FROM DUAL UNION
    SELECT 'SCHEDULE_2019_101_01', '20190103',  101, 'STOP_1002', '8:50' FROM DUAL 
一旦得到了预期的结果,剩下的就很简单了,我可以将数据拆分为两个表: 第一个包含日期和计划id的映射,第二个包含计划id、停止和时间

另外,您可以忽略具有空值的行,它只是用于脚本中列名的可读性。

您可以使用
listag()
dense\u rank()
。首先使用
listag()
进行分类:

然后指定一个值:

select b.*,
       rank() over (order by busid, stops) as scheule_number
from (select calendardate, busid,
             listagg(stopid || '-' || time, ',') within group (order by time) as stops
      from buslines
      group by calendardate, busid
     ) b;

请注意,这包括了排名中的
busid
,因此两个停靠相同站点的公交车具有不同的时刻表编号。

查询在test env上运行良好,但在prod env上它给出了错误:SQL错误:ORA-01489:字符串连接的结果太长,我想我必须创建一个用户定义的agg函数。@AdamYan。最长的母线有多长?stopid和时间应该大约为10个字符。我想100个车站对一辆公共汽车来说太多了。你也许可以过滤异常值。我可以达到50+站和100+次。这就是5000个委员会*5*2=每条总线每天大约50000个字符。@AdamYan。使用50个停止,您可能能够将停止和时间编码为少于15个字符。如果是这样,那么字符串可能不会超过内置限制。谢谢Gordon,我让它工作了,因为原始数据使用了重叠的日期范围,如date_from和date_to,所以我在date_from | date_to上使用了ListAGG函数,这绕过了01489错误。我还计划对listag结果使用ORA_散列,因此我将使用稳定的标识符,而不是非稳定的rank()结果。
select b.*,
       rank() over (order by busid, stops) as scheule_number
from (select calendardate, busid,
             listagg(stopid || '-' || time, ',') within group (order by time) as stops
      from buslines
      group by calendardate, busid
     ) b;