Python 3.x 删除webhoseio爬网的重复提要标题

Python 3.x 删除webhoseio爬网的重复提要标题,python-3.x,Python 3.x,获得了webhoseio抓取的新闻源。现在需要删除重复的标题。下面是我的代码。一定是出了问题,因为输出仍然有重复的标题。请帮我找出问题所在。谢谢 count_dup = 0 for j in range(0,len(feeds)): SELECTED_INDEX = j feed_sel = feeds[SELECTED_INDEX] #print(feed_sel['title']) feed_hash = Simhash(str(feed_sel['title'])) dup_indices

获得了webhoseio抓取的新闻源。现在需要删除重复的标题。下面是我的代码。一定是出了问题,因为输出仍然有重复的标题。请帮我找出问题所在。谢谢

count_dup = 0
for j in range(0,len(feeds)):
SELECTED_INDEX = j
feed_sel = feeds[SELECTED_INDEX]
#print(feed_sel['title'])
feed_hash = Simhash(str(feed_sel['title']))
dup_indices = index.get_near_dups(feed_hash)
#print("Number of duplicates (SimHash): " + str(len(dup_indices)))

for dupi in dup_indices:
    try:
        score = calc_similarity(feed_sel['title'], feeds[int(dupi)]['title'], model_word2vec)
    except:
        score = 0
    if score > 0.85:
        if feeds[int(dupi)]['id'] == j:
            print(feeds[int(dupi)]['id'], feeds[int(dupi)]['title'])
        else:
            feeds.pop(feeds[int(dupi)]['id'] - count_dup)
            count_dup += 1