Python 查找网站中不存在的单词_Python_Scrapy

Python 查找网站中不存在的单词

python scrapy

Python 查找网站中不存在的单词,python,scrapy,Python,Scrapy,我正在写一个抓痒蜘蛛，它应该能找到网站内容（文本）中是否存在特定的字符串。我有许多网站（几千个）和许多需要查找的字符串，因此我在代码中使用绑定到变量的列表的原因。有些列表是从其他python文件导入的我遇到的问题是，尽管在使用开发工具手动检查URL后，我在URL中找不到字符串，但代码似乎产生了积极的“命中”。下面是代码和结果示例 import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders i

我正在写一个抓痒蜘蛛，它应该能找到网站内容（文本）中是否存在特定的字符串。我有许多网站（几千个）和许多需要查找的字符串，因此我在代码中使用绑定到变量的列表的原因。有些列表是从其他python文件导入的

我遇到的问题是，尽管在使用开发工具手动检查URL后，我在URL中找不到字符串，但代码似乎产生了积极的“命中”。下面是代码和结果示例

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from list_loop import *
import re
 
word_to_find = 'pharmacy'
 
 
class TestSpider(CrawlSpider):
    name = 'test'
    # these are lists of a lot of domains imported from another
    # file called list_loop.py
    allowed_domains = strip_url
    start_urls = merch_url
 
    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )
 
    def parse_item(self, response):
        # Here I clean up the parsed text not to include /n or whitespace.
        words = response.xpath("//a//text()").getall()
        cleaned_words = [word.strip() for word in words]
        cleaned_words = [word.lower() for word in cleaned_words if len(word) > 0]
 
        # Then I loop through the cleaned_words in order to find a match
        for single_word in cleaned_words:
            re.search(r'\b%s\b' % word_to_find, single_word)
            yield{
                'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
            }
        else:
            pass

allowed_domains

和

start_url

列表中包含了阿里巴巴.com以及许多其他网站。运行spider后，我得到了这样一个结果输出：

{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},

许多其他网站的内容或HTML中实际上没有“pharmacy”一词，情况也是如此。你知道这里出了什么问题吗？

我相信你遗漏了一个if语句。在您的代码中，无论是否存在匹配项，您都将生成该语句

    for single_word in cleaned_words:
        re.search(r'\b%s\b' % word_to_find, single_word)
        yield{
            'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
        }

我相信你想要这样的东西：

    for single_word in cleaned_words:
        if re.search(r'\b%s\b' % word_to_find, single_word):
            yield{
                'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
            }

很高兴知道。请点击复选标记接受我的答案。