Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/date/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Xpath 可替换刮痕';s默认的lxml解析器,带有漂亮的Soup';什么是html5lib解析器?_Xpath_Beautifulsoup_Scrapy_Lxml - Fatal编程技术网

Xpath 可替换刮痕';s默认的lxml解析器,带有漂亮的Soup';什么是html5lib解析器?

Xpath 可替换刮痕';s默认的lxml解析器,带有漂亮的Soup';什么是html5lib解析器?,xpath,beautifulsoup,scrapy,lxml,Xpath,Beautifulsoup,Scrapy,Lxml,问题:有没有办法将BeautifulSoup的html5lib解析器集成到scrapy项目中——而不是scrapy的默认lxml解析器? Scrapy的解析器(对于某些元素)在我的scrape页面中失败。 每20页中只有2页会发生这种情况。 作为修复,我已将BeautifulSoup的解析器添加到项目中(可以正常工作)。 这就是说,我觉得我正在加倍使用条件和多个解析器……在某一点上,使用Scrapy解析器的原因是什么代码确实有效……感觉像是黑客攻击。 我不是专家,有没有更优雅的方法 提前感谢 更

问题:有没有办法将BeautifulSoup的html5lib解析器集成到scrapy项目中——而不是scrapy的默认lxml解析器?
Scrapy的解析器(对于某些元素)在我的scrape页面中失败。
每20页中只有2页会发生这种情况。

作为修复,我已将BeautifulSoup的解析器添加到项目中(可以正常工作)。
这就是说,我觉得我正在加倍使用条件和多个解析器……在某一点上,使用Scrapy解析器的原因是什么
代码确实有效……感觉像是黑客攻击。
我不是专家,有没有更优雅的方法 提前感谢

更新:
将中间件类添加到scrapy(来自python包)就像一个魔咒。显然,Scrapy的lxml不如BeautifulSoup的lxml强大。我不必求助于html5lib解析器——它的速度要慢30倍以上

class BeautifulSoupMiddleware(object):
    def __init__(self, crawler):
        super(BeautifulSoupMiddleware, self).__init__()

        self.parser = crawler.settings.get('BEAUTIFULSOUP_PARSER', "html.parser")

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_response(self, request, response, spider):
        """Overridden process_response would "pipe" response.body through BeautifulSoup."""
        return response.replace(body=str(BeautifulSoup(response.body, self.parser)))
原件:

import scrapy
from scrapy.item import Item, Field
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy import Selector
from scrapy.loader import ItemLoader
from w3lib.html import remove_tags
from bs4 import BeautifulSoup


class SimpleSpider(scrapy.Spider):
    name = 'SimpleSpider'
    allowed_domains = ['totally-above-board.com']
    start_urls = [
        'https://totally-above-board.com/nefarious-scrape-page.html'
    ]

    custom_settings = {
        'ITEM_PIPELINES': {
            'crawler.spiders.simple_spider.Pipeline': 400
        }
    }

    def parse(self, response):
        yield from self.parse_company_info(response)
        yield from self.parse_reviews(response)

    def parse_company_info(self, response):
        print('parse_company_info')
        print('==================')

        loader = ItemLoader(CompanyItem(), response=response)
        loader.add_xpath('company_name',
                         '//h1[contains(@class,"sp-company-name")]//span//text()')

        yield loader.load_item()

    def parse_reviews(self, response):
        print('parse_reviews')
        print('=============')

        # Beautiful Soup
        selector = Selector(response)

        # On the Page (Total Reviews) # 49
        search = '//span[contains(@itemprop,"reviewCount")]//text()'
        review_count = selector.xpath(search).get()
        review_count = int(float(review_count))

        # Number of elements Scrapy's LXML Could find # 0
        search = '//div[@itemprop ="review"]'
        review_element_count =   len(selector.xpath(search))

        # Use Scrapy or Beautiful Soup?
        if review_count > review_element_count:

            # Try Beautiful Soup
            soup = BeautifulSoup(response.text, "lxml")
            root = soup.findAll("div", {"itemprop": "review"})
            for review in root:
                loader = ItemLoader(ReviewItem(), selector=review)
                review_text = review.find("span", {"itemprop": "reviewBody"}).text
                loader.add_value('review_text', review_text)
                author = review.find("span", {"itemprop": "author"}).text
                loader.add_value('author', author)

                yield loader.load_item()
        else:
            # Try Scrapy 
            review_list_xpath = '//div[@itemprop ="review"]'
            selector = Selector(response)
            for review in selector.xpath(review_list_xpath):
                loader = ItemLoader(ReviewItem(), selector=review)
                loader.add_xpath('review_text',
                                 './/span[@itemprop="reviewBody"]//text()')

                loader.add_xpath('author',
                                 './/span[@itemprop="author"]//text()')

                yield loader.load_item()

        yield from self.paginate_reviews(response)

    def paginate_reviews(self, response):
        print('paginate_reviews')
        print('================')

        # Try Scrapy
        selector = Selector(response)
        search = '''//span[contains(@class,"item-next")]
                    //a[@class="next"]/@href
                 '''
        next_reviews_link = selector.xpath(search).get()

        # Try Beautiful Soup
        if next_reviews_link is None:
            soup = BeautifulSoup(response.text, "lxml")
            try:
                next_reviews_link = soup.find("a", {"class": "next"})['href']
            except Exception as e:
                pass


        if next_reviews_link:
            yield response.follow(next_reviews_link, self.parse_reviews)
它是一个用于XML/HTML刮取的Scrapy库

然而,您不需要等待这样一个特性的实现。您可以使用修复HTML代码,并在修复的HTML上使用Parsel:

从bs4导入美化组
# …
response=response.replace(body=str(BeautifulSoup(response.body,“html5lib”))
这是一个用于XML/HTML刮取的Scrapy库

然而,您不需要等待这样一个特性的实现。您可以使用修复HTML代码,并在修复的HTML上使用Parsel:

从bs4导入美化组
# …
response=response.replace(body=str(BeautifulSoup(response.body,“html5lib”))

我相信以前有人问过这个问题,答案是不,scrapy不能做html5-如果你需要html5,你应该尝试寻找其他东西。我相信以前有人问过这个问题,答案是不,scrapy不能做html5-如果你需要html5,你应该尝试寻找其他东西。