Python请求引发连接中断:尝试下载文件时ChunkedEncodingError with http.client.IncompleRead

Python请求引发连接中断:尝试下载文件时ChunkedEncodingError with http.client.IncompleRead,python,curl,selenium-webdriver,python-requests,wget,Python,Curl,Selenium Webdriver,Python Requests,Wget,我正在尝试使用请求模块下载PDF文件,代码如下: import requests url = "<url of the pdf>" r = requests.get(url, stream=True, timeout=(60, 120), headers={'Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chr

我正在尝试使用请求模块下载PDF文件,代码如下:

import requests

url = "<url of the pdf>"
r = requests.get(url, stream=True, timeout=(60, 120), headers={'Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10136'})

print(r.headers)
print(r.status_code)

try:
    with open('blah.pdf', 'wb') as f:
        for chunk in r:
            # print(chunk)
            f.write(chunk)
except Exception as e:
    print(e)
以下是完整的堆栈跟踪:

Traceback (most recent call last):
  File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 755, in read_chunked
    chunk = self._handle_chunk(amt)
  File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 709, in _handle_chunk
    self._fp._safe_read(2)  # Toss the CRLF at the end of the chunk.
  File "/storage/anaconda3/lib/python3.7/http/client.py", line 612, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/storage/anaconda3/lib/python3.7/site-packages/requests/models.py", line 750, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 560, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
    self._original_response.close()
  File "/storage/anaconda3/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 12, in <module>
    for chunk in r:
  File "/storage/anaconda3/lib/python3.7/site-packages/requests/models.py", line 753, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
使用
wget
下载的文件已损坏

我尝试过的另一件事是使用mitm和chromedriver+硒的组合来检查它

自动chrome浏览器无法加载pdf,并显示错误:

502 Bad Gateway
HttpSyntaxException('Malformed chunked body',)

如何使用
请求
模块下载此pdf?任何帮助都将不胜感激。

我和你有同样的问题,我不知道它为什么会发生。我用urrlib解决了这个问题:

urllib.request.urlretrieve(url, 'foo_file.txt', data=your_queries)
urlretrieve方法所做的是从链接获取数据,并在指定的文件名和指定为第二个参数的路径中复制数据。您还可以将类型更改为.pdf、.json或其他格式


这里有更多信息:

我在几天后解决了这个问题。服务器未正确关闭连接,因此python库抛出了
IncompleteReadError
。我使用安装在系统中的
curl
下载了它,参数为
--compressed
,以及所有必要的标题:

from subprocess import call

pdf_url = ""
pdf_filename = ""
call(["curl", pdf_url, 
    '-H', 'Connection: keep-alive', 
    '-H', 'Cache-Control: max-age=0', 
    '-H', 'Upgrade-Insecure-Requests: 1', 
    '-H', 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36', 
    '-H', 'Sec-Fetch-Mode: navigate', 
    '-H', 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 
    '-H', 'Sec-Fetch-Site: cross-site', 
    '-H', 'Accept-Encoding: gzip, deflate, br', 
    '-H', 'Accept-Language: en-US,en;q=0.9,bn;q=0.8', 
    '-H', 'Cookie: bbb=rd102o00000000000000000000ffff978432aao80', 
    '--compressed', '--output', pdf_filename])
采用模块化的方法。即使curl显示如下错误消息:

curl: (18) transfer closed with outstanding read data remaining

但是,下载的pdf可以使用任何pdf查看器打开。

与请求的错误不同,即
http.client.IncompleRead:IncompleRead(2290字节读取)
from subprocess import call

pdf_url = ""
pdf_filename = ""
call(["curl", pdf_url, 
    '-H', 'Connection: keep-alive', 
    '-H', 'Cache-Control: max-age=0', 
    '-H', 'Upgrade-Insecure-Requests: 1', 
    '-H', 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36', 
    '-H', 'Sec-Fetch-Mode: navigate', 
    '-H', 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 
    '-H', 'Sec-Fetch-Site: cross-site', 
    '-H', 'Accept-Encoding: gzip, deflate, br', 
    '-H', 'Accept-Language: en-US,en;q=0.9,bn;q=0.8', 
    '-H', 'Cookie: bbb=rd102o00000000000000000000ffff978432aao80', 
    '--compressed', '--output', pdf_filename])
curl: (18) transfer closed with outstanding read data remaining