使用Python';缺少标题信息;s请求库

使用Python';缺少标题信息;s请求库,python,python-requests,Python,Python Requests,我正在使用Python的(3.5.2)请求库(2.12.4)向网站发布查询。以下是我为此任务编写的脚本: #!/usr/bin/env python import requests # BaseURL being accessed url = 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi' # Dictionary of query parameters data = { 'INPUT_SEQUENC

我正在使用Python的(3.5.2)请求库(2.12.4)向网站发布查询。以下是我为此任务编写的脚本:

#!/usr/bin/env python

import requests

# BaseURL being accessed
url = 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi'

# Dictionary of query parameters
data = {
    'INPUT_SEQUENCE' : 'TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA',
    'ORGANISM'       : 'Mus musculus'
}

# Make a POST request and read the response
with requests.session() as session:
    poster = session.post(url, data=data)
    for key, value in poster.headers.items():
        print(key, ':', value)
我需要从响应的头信息中检索NCBI RCGI RetryURL字段。然而,我只能在使用Google Chrome中的HTTP跟踪扩展时看到这个字段。以下是使用Google Chrome发布和回复的完整跟踪:

POST https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi
Origin: https://www.ncbi.nlm.nih.gov
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=reset
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8
Cookie: sv-userdata-key-www.ncbi.nlm.nih.gov=G5KxXzyQ81U_vs1aHK_7XDWciF1B8AjjDUmDunVbhIZhZ4p4t_SVK4ASpbTT8iDSJVcxBH9oUAB3K2xNWjp3G0koYCloBlYuSxdoIGIkYzl2; ncbi_sid=0751457F9A561D01_0000SID; _ga=GA1.2.567134514.1503994317; _gid=GA1.2.654339584.1503994317; _gat=1; starnext=MYGwlsDWB2CmAeAXAXAbgK7RAewIYBM4lkAmAXgAcAnMAW1ioCMRcBnRAMgBYzm3FWsXFWAALDgEZymHAUkBOMgAYA7AFYpXFQDF5AQTUA2ACIBRFRKVXrN2xI4klygMJcSltQA59R0wGY/S1tg63t3Shp6JhZ2AFI/PQA5AHlE03i9PnZBYTEMlLSHcgB3UoA6aGBGMAqQWgqwUTKAc2wANwceajoGLMR81NMHQwie6P4BtIy+nJFxEhVRqL7J9ISZoTnVh08l3pj+lWcC9KON3NFYo5OHRQkuLiUOPyd5K2eJMnvH5/JPDWefjIADNcCBBM8eIgqOhYM81F83GpniMJH4SPJnosSFxnrtAoZ5Li/IoXp5DCpuE50RIpNxPlIAtxyNBcIgwG04Q8yDI8IQEJwuAiSBw1EDvk81DxPEo/KKEfIRUYyCRDIZRYtJbtaT81HcOIYnE9DAycUA=

HTTP/1.1 200 OK
Date: Tue, 29 Aug 2017 13:38:27 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Referrer-Policy: origin-when-cross-origin
Content-Security-Policy: upgrade-insecure-requests
Cache-Control: no-cache, no-store, max-age=0, private, must-revalidate
Expires: 0
NCBI-PHID: 0C421A7A9A56E5310000000000000001.m_2
NCBI-RCGI-RetryURL: https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi?ctg_time=1504013907&job_key=aWO2H68Wor6FhLSBueGQs8P6gYHu6Zqc7w
NCBI-SID: 0751457F9A561D01_0000SID
Pragma: no-cache
Access-Control-Allow-Methods: POST, GET, PUT, OPTIONS, PATCH, DELETE
Access-Control-Allow-Origin: https://www.ncbi.nlm.nih.gov
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin,X-Accept-Charset,X-Accept,Content-Type,X-Requested-With,NCBI-SID,NCBI-PHID
Content-Type: text/html
Set-Cookie: ncbi_sid=0751457F9A561D01_0000SID; domain=.nih.gov; path=/; expires=Wed, 29 Aug 2018 13:38:27 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block
Keep-Alive: timeout=1, max=9
Connection: Keep-Alive
Transfer-Encoding: chunked
下面是我从脚本中获得的所有标题信息:

Date : Tue, 29 Aug 2017 14:41:08 GMT
Server : Apache
Strict-Transport-Security : max-age=31536000; includeSubDomains; preload
Referrer-Policy : origin-when-cross-origin
Content-Security-Policy : upgrade-insecure-requests
Accept-Ranges : bytes
Vary : Accept-Encoding
Content-Encoding : gzip
X-UA-Compatible : IE=Edge
X-XSS-Protection : 1; mode=block
Content-Length : 2516
Keep-Alive : timeout=1, max=10
Connection : Keep-Alive
Content-Type : text/html
NCBI RCGI RetryURL字段很重要,因为它包含我需要执行GET请求才能检索结果的URL

编辑:

根据Maurice Meyer的建议更新脚本:

#!/usr/bin/env python

import requests

# BaseURL being accessed
url = 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi'

# Dictionary of query parameters
data = {
    'INPUT_SEQUENCE' : 'TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA',
    'ORGANISM'       : 'Mus musculus'
}

# Extra headers
headers = {
    'Origin' : 'https://www.ncbi.nlm.nih.gov',
    'Upgrade-Insecure-Requests' : '1',
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36',
    'Content-Type' : 'multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9',
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Referer' : 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=reset',
    'Accept-Encoding' : 'gzip, deflate, br',
    'Accept-Language' : 'en-US,en;q=0.8',
    'Cookie' : 'sv-userdata-key-www.ncbi.nlm.nih.gov=G5KxXzyQ81U_vs1aHK_7XDWciF1B8AjjDUmDunVbhIZhZ4p4t_SVK4ASpbTT8iDSJVcxBH9oUAB3K2xNWjp3G0koYCloBlYuSxdoIGIkYzl2; ncbi_sid=0751457F9A561D01_0000SID; _ga=GA1.2.567134514.1503994317; _gid=GA1.2.654339584.1503994317; _gat=1; starnext=MYGwlsDWB2CmAeAXAXAbgK7RAewIYBM4lkAmAXgAcAnMAW1ioCMRcBnRAMgBYzm3FWsXFWAALDgEZymHAUkBOMgAYA7AFYpXFQDF5AQTUA2ACIBRFRKVXrN2xI4klygMJcSltQA59R0wGY/S1tg63t3Shp6JhZ2AFI/PQA5AHlE03i9PnZBYTEMlLSHcgB3UoA6aGBGMAqQWgqwUTKAc2wANwceajoGLMR81NMHQwie6P4BtIy+nJFxEhVRqL7J9ISZoTnVh08l3pj+lWcC9KON3NFYo5OHRQkuLiUOPyd5K2eJMnvH5/JPDWefjIADNcCBBM8eIgqOhYM81F83GpniMJH4SPJnosSFxnrtAoZ5Li/IoXp5DCpuE50RIpNxPlIAtxyNBcIgwG04Q8yDI8IQEJwuAiSBw1EDvk81DxPEo/KKEfIRUYyCRDIZRYtJbtaT81HcOIYnE9DAycUA='
}

# Make a POST request and read the response
with requests.session() as session:
    poster = session.post(url, data=data, headers=headers)
    for key, value in poster.headers.items():
        print(key, ':', value)
更新的输出,仍然没有区别:

Date : Tue, 29 Aug 2017 15:05:27 GMT
Server : Apache
Strict-Transport-Security : max-age=31536000; includeSubDomains; preload
Referrer-Policy : origin-when-cross-origin
Content-Security-Policy : upgrade-insecure-requests
Accept-Ranges : bytes
Vary : Accept-Encoding
Content-Encoding : gzip
X-UA-Compatible : IE=Edge
X-XSS-Protection : 1; mode=block
Content-Length : 2517
Keep-Alive : timeout=1, max=10
Connection : Keep-Alive
Content-Type : text/html

两者之间的请求数据完全不同

特别是请求主体数据。因此,使用Python的请求库并没有丢失头信息,而是缺少POST请求到服务器中的信息

您不能简单地复制和粘贴标题

'Content-Type' : 'multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9',
或者只需将数据
INPUT\u SEQUENCE
ORGANISM
像这样发布即可-而且在任何情况下,您对ORGANISM的数据显然是错误的-粗略一看就会发现它是小肌肉(taxid:10090)而不是小肌肉

因此,您需要查看整个请求头和主体,然后创建一个包含服务器所需数据的请求。从这个角度看,您丢失了服务器需要响应的一批又一批数据

------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="INPUT_SEQUENCE"

TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEQFILE"; filename=""
Content-Type: application/octet-stream


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER5_START"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER5_END"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER3_START"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER3_END"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_LEFT_INPUT"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_RIGHT_INPUT"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_PRODUCT_MIN"

70
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_PRODUCT_MAX"

1000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_NUM_RETURN"

10
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MIN_TM"

57.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_OPT_TM"

60.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MAX_TM"

63.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MAX_DIFF_TM"

3
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_ON_SPLICE_SITE"

0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SPLICE_SITE_OVERLAP_5END"

7
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SPLICE_SITE_OVERLAP_3END"

4
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="MIN_INTRON_SIZE"

1000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="MAX_INTRON_SIZE"

1000000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEARCH_SPECIFIC_PRIMER"

on
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEARCHMODE"

0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_SPECIFICITY_DATABASE"

refseq_mrna
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="CUSTOM_DB"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="CUSTOMSEQFILE"; filename=""
Content-Type: application/octet-stream


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="ORGANISM"

Mus musculus (taxid:10090)
------WebKitFormBoundaryJVAJqDi2cI4BTfmc

etc...

您有办法更新
请求
库吗?我无法在v2.18上重现您的问题。4@Leva7使用Python(3.6.2)从GitHub(2.18.4)上的最新源代码安装。恐怕没有区别。发送与Chrome相同的标题(用户代理、内容类型、引用者等),然后您将收到与Chrome相同的标题。@MauriceMeyer尝试包含标题(请参阅原始问题中的编辑),但标题输出仍然没有区别。@MauriceMeyer对我来说,这使问题变得更糟。虽然我在默认设置下得到的标题比OP多得多,但添加Chrome标题会使我的输出减少到OP的输出