Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/311.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 这里出了什么错?_Python_Python 3.x_Scrapy_Web Crawler_Scrapy Spider - Fatal编程技术网

Python 这里出了什么错?

Python 这里出了什么错?,python,python-3.x,scrapy,web-crawler,scrapy-spider,Python,Python 3.x,Scrapy,Web Crawler,Scrapy Spider,这些是我的代码,但它似乎是正确的,但它不工作,请帮助 HEADER_XPATH = ['//h1[@class="story-body__h1"]//text()'] AUTHOR_XPATH = ['//span[@class="byline__name"]//text()'] PUBDATE_XPATH = ['//div/@data-datetime'] WTAGS_XPATH = [''] CATEGORY_XPATH = ['//span[@rev="news|

这些是我的代码,但它似乎是正确的,但它不工作,请帮助

HEADER_XPATH = ['//h1[@class="story-body__h1"]//text()']    
AUTHOR_XPATH = ['//span[@class="byline__name"]//text()']   
PUBDATE_XPATH = ['//div/@data-datetime']  
WTAGS_XPATH = ['']   
CATEGORY_XPATH = ['//span[@rev="news|source""]//text()']    
TEXT = ['//div[@property="articleBody"]//p//text()']   
INTERLINKS = ['//div[@class="story-body__link"]//p//a/@href']  
DATE_FORMAT_STRING = '%Y-%m-%d'

class BBCSpider(Spider):
    name = "bbc"
    allowed_domains = ["bbc.com"]
    sitemap_urls = [
        'http://Www.bbc.com/news/sitemap/',
        'http://www.bbc.com/news/technology/',
        'http://www.bbc.com/news/science_and_environment/']

    def parse_page(self, response):
        items = []
        item = ContentItems()
        item['title'] = process_singular_item(self, response, HEADER_XPATH, single=True)
        item['resource'] = urlparse(response.url).hostname
        item['author'] = process_array_item(self, response, AUTHOR_XPATH, single=False)
        item['pubdate'] = process_date_item(self, response, PUBDATE_XPATH, DATE_FORMAT_STRING, single=True)
        item['tags'] = process_array_item(self, response, TAGS_XPATH, single=False)
        item['category'] = process_array_item(self, response, CATEGORY_XPATH, single=False)
        item['article_text'] = process_article_text(self, response, TEXT)
        item['external_links'] = process_external_links(self, response, INTERLINKS, single=False)
        item['link'] = response.url
        items.append(item)
        return items

您的蜘蛛只是结构不好,因此它什么也不做。
scrapy.Spider
Spider需要
start\u url
class属性,该属性应包含Spider将用于启动爬网的URL列表,所有这些URL都将回调到class方法
parse
,这意味着它也是必需的

您的爬行器具有
sitemap\u URL
class属性,并且没有在任何地方使用它,您的爬行器还具有
parse\u page
class方法,这些方法也从未在任何地方使用过。
总之,你的蜘蛛应该是这样的:

class BBCSpider(Spider):
    name = "bbc"
    allowed_domains = ["bbc.com"]
    start_urls = [
        'http://Www.bbc.com/news/sitemap/',
        'http://www.bbc.com/news/technology/',
        'http://www.bbc.com/news/science_and_environment/']

    def parse(self, response):
        # This is a page with all of the articles
        article_urls = # find article urls in the pages
        for url in article_urls:
            yield Request(url, self.parse_page)

    def parse_page(self, response):
        # This is an article page
        items = []
        item = ContentItems()
        # populate item
        return item

有什么问题?也许可以解释一下问题是什么?输入渴望产出?你在做什么?问题是当我运行代码时,什么都没有发生。它没有通过网页!我认为我的错误在于变量@MooingRawrI真的很欣赏it@nik伟大的如果它解决了您的问题,请随意单击它左侧的“接受问题按钮”。当然,我会的。你能在文章中告诉我我必须提供什么作为例子吗?(因为我处理很多URL)你需要找到一个xpath,类似这样:
response.xpath(//a[contains(@class,'title-link')]/@href”).extract()
找到一些。