Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/visual-studio-2008/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 查找网站中不存在的单词_Python_Scrapy - Fatal编程技术网

Python 查找网站中不存在的单词

Python 查找网站中不存在的单词,python,scrapy,Python,Scrapy,我正在写一个抓痒蜘蛛,它应该能找到网站内容(文本)中是否存在特定的字符串。我有许多网站(几千个)和许多需要查找的字符串,因此我在代码中使用绑定到变量的列表的原因。有些列表是从其他python文件导入的 我遇到的问题是,尽管在使用开发工具手动检查URL后,我在URL中找不到字符串,但代码似乎产生了积极的“命中”。下面是代码和结果示例 import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders i

我正在写一个抓痒蜘蛛,它应该能找到网站内容(文本)中是否存在特定的字符串。我有许多网站(几千个)和许多需要查找的字符串,因此我在代码中使用绑定到变量的列表的原因。有些列表是从其他python文件导入的

我遇到的问题是,尽管在使用开发工具手动检查URL后,我在URL中找不到字符串,但代码似乎产生了积极的“命中”。下面是代码和结果示例

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from list_loop import *
import re
 
word_to_find = 'pharmacy'
 
 
class TestSpider(CrawlSpider):
    name = 'test'
    # these are lists of a lot of domains imported from another
    # file called list_loop.py
    allowed_domains = strip_url
    start_urls = merch_url
 
    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )
 
    def parse_item(self, response):
        # Here I clean up the parsed text not to include /n or whitespace.
        words = response.xpath("//a//text()").getall()
        cleaned_words = [word.strip() for word in words]
        cleaned_words = [word.lower() for word in cleaned_words if len(word) > 0]
 
        # Then I loop through the cleaned_words in order to find a match
        for single_word in cleaned_words:
            re.search(r'\b%s\b' % word_to_find, single_word)
            yield{
                'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
            }
        else:
            pass
allowed_domains
start_url
列表中包含了阿里巴巴.com以及许多其他网站。运行spider后,我得到了这样一个结果输出:

{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},

许多其他网站的内容或HTML中实际上没有“pharmacy”一词,情况也是如此。你知道这里出了什么问题吗?

我相信你遗漏了一个if语句。在您的代码中,无论是否存在匹配项,您都将生成该语句

    for single_word in cleaned_words:
        re.search(r'\b%s\b' % word_to_find, single_word)
        yield{
            'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
        }
我相信你想要这样的东西:

    for single_word in cleaned_words:
        if re.search(r'\b%s\b' % word_to_find, single_word):
            yield{
                'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
            }

很高兴知道。请点击复选标记接受我的答案。