Algorithm 有没有更好的方法来查找搜索引擎代码的集合交集？_Algorithm_Set_Search Engine_Intersection_Information Retrieval

Algorithm 有没有更好的方法来查找搜索引擎代码的集合交集？

algorithm

Algorithm 有没有更好的方法来查找搜索引擎代码的集合交集？,algorithm,set,search-engine,intersection,information-retrieval,Algorithm,Set,Search Engine,Intersection,Information Retrieval,我一直在编写一个小型搜索引擎，需要找出是否有一种更快的方法来找到集合交点。目前，我使用的是大多数搜索引擎算法中解释的排序链表。i、 e对于每个单词，我都有一个按列表排序的文档列表，然后找到列表之间的交集案例的性能分析是。对于更快的集合交叉口，还有其他想法吗？一种有效的方法是“之字形”：假设您的术语是一个列表T： lastDoc <- 0 //the first doc in the collection currTerm <- 0 //the first term in T w

我一直在编写一个小型搜索引擎，需要找出是否有一种更快的方法来找到集合交点。目前，我使用的是大多数搜索引擎算法中解释的排序链表。i、 e对于每个单词，我都有一个按列表排序的文档列表，然后找到列表之间的交集

案例的性能分析是。

对于更快的集合交叉口，还有其他想法吗？

一种有效的方法是“之字形”：

假设您的术语是一个列表

：

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

lastDoc这里有一个用于比较当前算法的量化分析。
您可以从二进制搜索开始，避免开始时的线性步进。（这可以通过一些“搜索”方法扩展到重叠部分）顺便说一句：链表不是大型排序集的最佳表示。你可以试试数组。二进制搜索是个好主意。如果引入，它将有助于跳绳。那么，如果列表/数组仅在更新搜索数据结构的过程中更改，那么数组与列表是否真的很重要？很多人会尝试一下，看看效果如何。塔克斯