Python 2.7 使用Beauty soup获取文本时发生ContentDecoding错误

Python 2.7 使用Beauty soup获取文本时发生ContentDecoding错误,python-2.7,beautifulsoup,gzip,Python 2.7,Beautifulsoup,Gzip,我有一个URL列表(来自HuffPost UK),我需要从中获取文本。我将它们存储在csv文件中,但我只是将它们作为列表复制/粘贴到下面。我的代码有两个问题(以前与其他发布者的代码配合得很好…) 它会随着ContentDecodingError随机停止 它随机无法生成文本 我随机地说,因为当我运行它几次时,它会在不同的URL上停止。有时打印文本,有时打印同一URL的空字符串。我不知道发生了什么事。有人能告诉我出了什么问题吗?我将非常感谢你的帮助 我的代码: import codecs impor

我有一个URL列表(来自HuffPost UK),我需要从中获取文本。我将它们存储在csv文件中,但我只是将它们作为列表复制/粘贴到下面。我的代码有两个问题(以前与其他发布者的代码配合得很好…)

  • 它会随着ContentDecodingError随机停止
  • 它随机无法生成文本
  • 我随机地说,因为当我运行它几次时,它会在不同的URL上停止。有时打印文本,有时打印同一URL的空字符串。我不知道发生了什么事。有人能告诉我出了什么问题吗?我将非常感谢你的帮助

    我的代码:

    import codecs
    import translitcodec
    import requests
    from bs4 import BeautifulSoup
    
    def get_text(url):
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "lxml")    
        # delete unwanted tags:
        for s in soup(['h2', 'figure', 'script', 'style', 'table']):
            s.decompose()
        # use separator to separate paragraphs and subtitles!
        article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'content-list-component text'})]    
        text = ' '.join(article_soup)
        text = codecs.encode(text, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
        text = u"{}".format(text) #encode to unicode
        print text
        return text
    
    urls = ['http://www.huffingtonpost.co.uk/2017/06/21/damian-green-tories-housing-education_n_17244280.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/the-waugh-zone-thursday-june-22-2017_n_17253136.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/argos-toys-christmas-2017_n_17248026.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/ore-oduba-strictly-come-dancing-joanne-clifton_n_17253186.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/joanne-clifton-flashdance-strictly-come-dancing_n_17253268.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/grenfell-tower-cladding-may-have-released-hydrogen-cyanide_n_17252776.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/uk-will-have-to-trawl-through-19000-eu-laws-to-decide-which-ones-to-keep-after-brexit_n_17242732.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-jeremy-corbyn-theresa-may_n_17241446.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/piers-morgan-good-morning-britain-bbc-breakfast-dan-walker-ratings_n_17252222.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/worst-bridezilla-stories-ever-reddit_n_17253210.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/donald-trump-uk-state-visit-shelved-after-no-mention-in-queens-speech-2017_n_17239686.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/failure-may-state_n_17242710.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-13-things-missing-from-theresa-mays-first-one_n_17239692.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/heartbroken-best-man-gatecrashes-bride-and-grooms-wedding-photos-and-its-comedy-gold_n_17253104.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-jeremy-corbyn-mocks-theresa-mays-imploding-minority-government_n_17242692.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/asda-the-little-mermaid-swimsuit-topless_n_17253262.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/20/chaotic-brexit-theresa-may_n_17248024.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/the-waugh-zone-special-queens-speech-2017_n_17246444.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/grenfell-tower-residents-to-be-rehoused-in-luxury-kensington-row-flats_n_17242518.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/gin-does-not-help-relieve-hay-fever-experts-say_n_17243102.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/20/theresa-may-savoy_n_17227558.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/crewe-crane-collapse_n_17243884.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/rebecca-burger-french-fitness-blogger-killed-by-exploding-cream-dispenser_n_17253286.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/05/31/the-waugh-zone-may-31-201_0_n_16891450.html?ir=UK+Politics', 'http://www.huffingtonpost.co.uk/2017/06/22/theresa-may-reveals-tests-show-other-towers-combustible-following-grenfell-tower-fire_n_17253204.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/owen-jones-gleefully-brands-daily-mail-an-open-sewer_n_17253464.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/laura-kenny-interview-ambition-after-pregnancy_n_17252498.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/boris-johnson-radio-4-eddie-mair-two-ronnies_n_17245044.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/grenfell-tower-homes-theresa-may_n_17246764.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/dup-pushover-deal_n_17253218.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/khan-remain-rights_n_17243656.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/love-island-zara-holland-sex-miss-great-britain_n_17242768.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/man-sent-home-from-work-wearing-shorts_n_17243276.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/courteney-cox-fillers-surgery-face_n_17252410.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/jeremy-corbyn-observed-protocol-by-not-bowing-to-the-queen_n_17240658.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/alexandra-shulman-british-vogue-good-morning-britain-the-queen_n_17253200.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/teaching-excellence-framework-results-universities-gold-ranking_n_17253426.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/prince-harry-slams-decision-to-make-him-walk-behind-princess-dianas-coffin_n_17253188.html?utm_hp_ref=uk']
    for url in urls:
        print url
        text = get_text(url)
    
    错误:

    ---------------------------------------------------------------------------
    ContentDecodingError                      Traceback (most recent call last)
    <ipython-input-12-54bdf2585415> in <module>()
         21 for url in urls:
         22     print url
    ---> 23     text = get_text(url)
    
    <ipython-input-12-54bdf2585415> in get_text(url)
          5 
          6 def get_text(url):
    ----> 7     r = requests.get(url)
          8     soup = BeautifulSoup(r.content, "lxml")
          9     # delete unwanted tags:
    
    /Applications/anaconda/lib/python2.7/site-packages/requests/api.pyc in get(url, params, **kwargs)
         68 
         69     kwargs.setdefault('allow_redirects', True)
    ---> 70     return request('get', url, params=params, **kwargs)
         71 
         72 
    
    /Applications/anaconda/lib/python2.7/site-packages/requests/api.pyc in request(method, url, **kwargs)
         54     # cases, and look like a memory leak in others.
         55     with sessions.Session() as session:
    ---> 56         return session.request(method=method, url=url, **kwargs)
         57 
         58 
    
    /Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
        486         }
        487         send_kwargs.update(settings)
    --> 488         resp = self.send(prep, **send_kwargs)
        489 
        490         return resp
    
    /Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, **kwargs)
        628 
        629         # Resolve redirects if allowed.
    --> 630         history = [resp for resp in gen] if allow_redirects else []
        631 
        632         # Shuffle things around if there's history.
    
    /Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc in resolve_redirects(self, resp, req, stream, timeout, verify, cert, proxies, **adapter_kwargs)
        188                 proxies=proxies,
        189                 allow_redirects=False,
    --> 190                 **adapter_kwargs
        191             )
        192 
    
    /Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, **kwargs)
        639 
        640         if not stream:
    --> 641             r.content
        642 
        643         return r
    
    /Applications/anaconda/lib/python2.7/site-packages/requests/models.pyc in content(self)
        795                 self._content = None
        796             else:
    --> 797                 self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
        798 
        799         self._content_consumed = True
    
    /Applications/anaconda/lib/python2.7/site-packages/requests/models.pyc in generate()
        722                     raise ChunkedEncodingError(e)
        723                 except DecodeError as e:
    --> 724                     raise ContentDecodingError(e)
        725                 except ReadTimeoutError as e:
        726                     raise ConnectionError(e)
    
    ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing: incorrect header check',))
    
    ---------------------------------------------------------------------------
    ContentDecodingError回溯(上次最近的调用)
    在()
    21对于url中的url:
    22打印url
    --->23 text=获取文本(url)
    在获取文本(url)中
    5.
    6 def get_文本(url):
    ---->7 r=请求.get(url)
    8汤=美汤(r.含量,“lxml”)
    9#删除不需要的标签:
    /get中的Applications/anaconda/lib/python2.7/site-packages/requests/api.pyc(url,params,**kwargs)
    68
    69 kwargs.setdefault('allow_redirects',True)
    --->70返回请求('get',url,params=params,**kwargs)
    71
    72
    /请求中的Applications/anaconda/lib/python2.7/site-packages/requests/api.pyc(方法、url、**kwargs)
    54个案例,在其他案例中看起来像是内存泄漏。
    55带有会话。会话()作为会话:
    --->56返回会话。请求(方法=方法,url=url,**kwargs)
    57
    58
    /请求中的Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc(self、方法、url、参数、数据、头、cookie、文件、身份验证、超时、允许重定向、代理、挂钩、流、验证、证书、json)
    486         }
    487发送文件更新(设置)
    -->488 resp=自我发送(准备,**发送)
    489
    490返回响应
    /发送中的Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc(self、request、**kwargs)
    628
    629#如果允许,解决重定向问题。
    -->630历史=[resp for resp in gen]如果允许,则重定向else[]
    631
    632#如果有历史的话,把事情弄得乱七八糟。
    /解析重定向(self、resp、req、stream、timeout、verify、cert、proxy、**适配器)中的Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc
    188代理=代理,
    189 allow_redirects=False,
    -->190**适配器
    191             )
    192
    /发送中的Applications/anaconda/lib/python2.7/site-packages/requests/sessions.pyc(self、request、**kwargs)
    639
    640如果不是流:
    -->641 r.内容
    642
    643返回r
    /内容中的Applications/anaconda/lib/python2.7/site-packages/requests/models.pyc(self)
    795自身内容=无
    796其他:
    -->797 self.\u content=bytes().join(self.iter\u content(content\u CHUNK\u SIZE))或bytes()
    798
    799自我内容消费=真实
    /generate()中的Applications/anaconda/lib/python2.7/site-packages/requests/models.pyc
    722编码错误(e)
    723除解码错误为e外:
    -->724提升内容解码错误(e)
    725除ReadTimeOuter错误为e外:
    726升起连接器错误(e)
    ContentDecodingError:(“接收到内容编码为gzip的响应,但未能对其进行解码。”,错误('解压缩时出现错误-3:标头检查不正确',))
    
    我终于设法解决了这个问题。在打开每个url之前,我需要使用Selenium和PhantomJS以允许页面正确加载

    我在创建汤之前添加的这段代码帮助解决了这个问题:

    driver = webdriver.PhantomJS(executable_path='PATH TO phantomjs')
    driver.get(url) 
    waitForLoad(driver)
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    
    我还使用了函数waitForLoad(driver):如中所述

    这是最终工作代码:

    import codecs
    import translitcodec
    import requests
    from bs4 import BeautifulSoup
    from selenium import webdriver
    import time
    from selenium.webdriver.remote.webelement import WebElement
    from selenium.common.exceptions import StaleElementReferenceException
    
    def waitForLoad(driver):
        elem = driver.find_element_by_tag_name("html") 
        count = 0
        while True:
            count += 1
            if count > 20:
                print("Timing out after 10 seconds and returning")
                return
            time.sleep(.5) 
            try:
                elem == driver.find_element_by_tag_name("html") 
            except StaleElementReferenceException:
                return
    
    def get_text(url):
        driver = webdriver.PhantomJS(executable_path='PATH TO phantomjs')
        driver.get(url) 
        waitForLoad(driver)
        html = driver.page_source
        soup = BeautifulSoup(html, "lxml") 
        # delete unwanted tags:
        for s in soup(['h2', 'figure', 'script', 'style', 'table']):
            s.decompose()
        # use separator to separate paragraphs and subtitles!
        article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'content-list-component text'})]    
        text = ' '.join(article_soup)
        text = codecs.encode(text, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
        text = u"{}".format(text) #encode to unicode
        print text
        return text
    
    urls = ['http://www.huffingtonpost.co.uk/2017/06/21/damian-green-tories-housing-education_n_17244280.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/the-waugh-zone-thursday-june-22-2017_n_17253136.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/argos-toys-christmas-2017_n_17248026.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/ore-oduba-strictly-come-dancing-joanne-clifton_n_17253186.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/joanne-clifton-flashdance-strictly-come-dancing_n_17253268.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/grenfell-tower-cladding-may-have-released-hydrogen-cyanide_n_17252776.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/uk-will-have-to-trawl-through-19000-eu-laws-to-decide-which-ones-to-keep-after-brexit_n_17242732.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-jeremy-corbyn-theresa-may_n_17241446.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/piers-morgan-good-morning-britain-bbc-breakfast-dan-walker-ratings_n_17252222.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/worst-bridezilla-stories-ever-reddit_n_17253210.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/donald-trump-uk-state-visit-shelved-after-no-mention-in-queens-speech-2017_n_17239686.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/failure-may-state_n_17242710.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-13-things-missing-from-theresa-mays-first-one_n_17239692.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/heartbroken-best-man-gatecrashes-bride-and-grooms-wedding-photos-and-its-comedy-gold_n_17253104.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/queens-speech-2017-jeremy-corbyn-mocks-theresa-mays-imploding-minority-government_n_17242692.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/asda-the-little-mermaid-swimsuit-topless_n_17253262.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/20/chaotic-brexit-theresa-may_n_17248024.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/the-waugh-zone-special-queens-speech-2017_n_17246444.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/grenfell-tower-residents-to-be-rehoused-in-luxury-kensington-row-flats_n_17242518.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/gin-does-not-help-relieve-hay-fever-experts-say_n_17243102.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/20/theresa-may-savoy_n_17227558.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/crewe-crane-collapse_n_17243884.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/rebecca-burger-french-fitness-blogger-killed-by-exploding-cream-dispenser_n_17253286.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/05/31/the-waugh-zone-may-31-201_0_n_16891450.html?ir=UK+Politics', 'http://www.huffingtonpost.co.uk/2017/06/22/theresa-may-reveals-tests-show-other-towers-combustible-following-grenfell-tower-fire_n_17253204.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/owen-jones-gleefully-brands-daily-mail-an-open-sewer_n_17253464.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/laura-kenny-interview-ambition-after-pregnancy_n_17252498.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/boris-johnson-radio-4-eddie-mair-two-ronnies_n_17245044.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/grenfell-tower-homes-theresa-may_n_17246764.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/dup-pushover-deal_n_17253218.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/khan-remain-rights_n_17243656.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/love-island-zara-holland-sex-miss-great-britain_n_17242768.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/man-sent-home-from-work-wearing-shorts_n_17243276.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/courteney-cox-fillers-surgery-face_n_17252410.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/21/jeremy-corbyn-observed-protocol-by-not-bowing-to-the-queen_n_17240658.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/alexandra-shulman-british-vogue-good-morning-britain-the-queen_n_17253200.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/teaching-excellence-framework-results-universities-gold-ranking_n_17253426.html?utm_hp_ref=uk', 'http://www.huffingtonpost.co.uk/2017/06/22/prince-harry-slams-decision-to-make-him-walk-behind-princess-dianas-coffin_n_17253188.html?utm_hp_ref=uk']
    for url in urls:
        print url
        text = get_text(url)