Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/sql/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
sqlserver中的优先级队列_Sql_Sql Server_Stored Procedures_Priority Queue_Query Performance - Fatal编程技术网

sqlserver中的优先级队列

sqlserver中的优先级队列,sql,sql-server,stored-procedures,priority-queue,query-performance,Sql,Sql Server,Stored Procedures,Priority Queue,Query Performance,我目前正在用C#构建一个网络爬虫。要对尚未爬网的URL进行排队,我使用SQL Server。它工作得非常快,但随着时间的推移,它开始变得非常大,这会减慢我的存储过程 CREATE TABLE PriorityQueue ( ID int IDENTITY(0,1) PRIMARY KEY, absolute_url varchar (400), depth int, priorty int, domain_host varchar (255), ); CREATE INDEX queueIte

我目前正在用C#构建一个网络爬虫。要对尚未爬网的URL进行排队,我使用SQL Server。它工作得非常快,但随着时间的推移,它开始变得非常大,这会减慢我的存储过程

CREATE TABLE PriorityQueue
(
ID int IDENTITY(0,1) PRIMARY KEY,
absolute_url varchar (400),
depth int,
priorty int,
domain_host varchar (255),
);

CREATE INDEX queueItem ON PriorityQueue(absolute_url);
CREATE INDEX queueHost ON PriorityQueue(domain_host);
这是我用于队列的表。优先级从1到5,其中1是最高优先级。正如您所看到的,我还在下面的存储过程中使用索引

向队列添加新项目的过程:

DROP PROCEDURE IF EXISTS dbo.Enqueue
GO
CREATE PROCEDURE dbo.Enqueue(@absolute_url varchar(255), @depth int, @priorty int, @host varchar(255))
AS
BEGIN
    INSERT INTO [WebshopCrawler].[dbo].[PriorityQueue] (absolute_url, depth, priorty, domain_host) VALUES (@absolute_url, @depth, @priorty, @host);
END
GO
DROP PROCEDURE IF EXISTS dbo.Dequeue
GO
CREATE PROCEDURE dbo.Dequeue
AS
BEGIN
    SELECT top 1 absolute_url, depth, priorty
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE priorty = (SELECT MIN(priorty) FROM [WebshopCrawler].[dbo].[PriorityQueue])
END
GO
DROP PROCEDURE IF EXISTS dbo.RemoveFromQueue
GO
CREATE PROCEDURE dbo.RemoveFromQueue(@absolute_url varchar(400))
AS
BEGIN
    DELETE 
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE absolute_url = @absolute_url
END
GO
获取具有最高优先级的项目的过程:

DROP PROCEDURE IF EXISTS dbo.Enqueue
GO
CREATE PROCEDURE dbo.Enqueue(@absolute_url varchar(255), @depth int, @priorty int, @host varchar(255))
AS
BEGIN
    INSERT INTO [WebshopCrawler].[dbo].[PriorityQueue] (absolute_url, depth, priorty, domain_host) VALUES (@absolute_url, @depth, @priorty, @host);
END
GO
DROP PROCEDURE IF EXISTS dbo.Dequeue
GO
CREATE PROCEDURE dbo.Dequeue
AS
BEGIN
    SELECT top 1 absolute_url, depth, priorty
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE priorty = (SELECT MIN(priorty) FROM [WebshopCrawler].[dbo].[PriorityQueue])
END
GO
DROP PROCEDURE IF EXISTS dbo.RemoveFromQueue
GO
CREATE PROCEDURE dbo.RemoveFromQueue(@absolute_url varchar(400))
AS
BEGIN
    DELETE 
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE absolute_url = @absolute_url
END
GO
随着数据量的增加,这一步变得非常缓慢

删除排队项目的过程:

DROP PROCEDURE IF EXISTS dbo.Enqueue
GO
CREATE PROCEDURE dbo.Enqueue(@absolute_url varchar(255), @depth int, @priorty int, @host varchar(255))
AS
BEGIN
    INSERT INTO [WebshopCrawler].[dbo].[PriorityQueue] (absolute_url, depth, priorty, domain_host) VALUES (@absolute_url, @depth, @priorty, @host);
END
GO
DROP PROCEDURE IF EXISTS dbo.Dequeue
GO
CREATE PROCEDURE dbo.Dequeue
AS
BEGIN
    SELECT top 1 absolute_url, depth, priorty
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE priorty = (SELECT MIN(priorty) FROM [WebshopCrawler].[dbo].[PriorityQueue])
END
GO
DROP PROCEDURE IF EXISTS dbo.RemoveFromQueue
GO
CREATE PROCEDURE dbo.RemoveFromQueue(@absolute_url varchar(400))
AS
BEGIN
    DELETE 
    FROM [WebshopCrawler].[dbo].[PriorityQueue]
    WHERE absolute_url = @absolute_url
END
GO
我试着使用很多不同的索引,但似乎没有什么能让程序运行得更快。我希望有人对如何改进这一点有想法。

请阅读。重要问题:

  • 必须根据出列策略组织表。标识中的主键毫无意义。使用基于优先级和出列顺序的聚集索引
  • 您必须在一条语句中自动出列,使用
    DELETE。。。输出…
所以应该是这样的:

CREATE TABLE PriorityQueue
(
  priority int not null,
  enqueue_time datetime not null default GETUTCDATE(),
  absolute_url varchar (8000) not null,
  depth int not null,
  domain_host varchar (255) not null,
);

CREATE CLUSTERED INDEX PriorityQueueCdx on PriorityQueue(priority DESC, enqueue_time);

CREATE PROCEDURE dbo.Dequeue
AS
BEGIN
    with cte as (
       SELECT top 1 absolute_url, depth, priority
       FROM [PriorityQueue] with (rowlock, readpast)
       ORDER BY priority DESC, enqueue_time)
     DELETE FROM cte
         OUTPUT DELETED.*;
END
GO

default GETUTCDATE()
>最好为该约束指定一个名称,而不是让SQL Server为其指定一个随机名称。我知道这只是插图=),但人们可能会盲目地复制它,认为不命名约束是一种好的做法。其次,如果使用相同的
enqueue\u time
添加行,则无法保证排序,这会发生在快速插入或批量插入时。这与queue.TT的想法背道而驰。你是对的,我已经尝试了上面的方法,效果很好,但是由于多线程的原因,URL可以同时插入。如果两个条目在同一时间被添加,那么应该先将其中哪一个退出队列?我的观点是,如果它们在同一时间被添加,那么除了随机性之外,就没有别的东西可以决定哪一个首先被排出来。它们在顺序上是相等的。Datetime的分辨率为.000、.003和.007。中间的数值四舍五入到这些值中最接近的值。在~.001和~.000处插入的行都存储为~.000。明白我的意思了吗?对不起,回答中描述的队列已断开