python中的Web抓取urlopen_Python_Urlopen

python中的Web抓取urlopen

python

python中的Web抓取urlopen,python,urlopen,Python,Urlopen,我正在尝试从以下网站获取数据：似乎urlopen没有得到html代码，我不明白为什么。它是这样的： html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS") print (html) 我的代码是正确的，我用相同的代码获得了其他网页的html源代码，但它似乎不识别这个地址上面印着：b“ 也许另一个图书馆更合适？为什么ur

我正在尝试从以下网站获取数据：

似乎urlopen没有得到html代码，我不明白为什么。它是这样的：

html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
print (html)

我的代码是正确的，我用相同的代码获得了其他网页的html源代码，但它似乎不识别这个地址

上面印着：b“

也许另一个图书馆更合适？为什么urlopen不返回网页的html代码？

救命，谢谢

我已经用和在终端上用curl测试了您的URL。两者都很好：

URL = "http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS"
h = httplib2.Http()
resp, content = h.request(URL, "GET")
print(content)

因此，对我来说，要么urllib.request中存在错误，要么发生了非常奇怪的客户机-服务器交互。

我个人写道：

# Python 2.7

import urllib

url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'
sock = urllib.urlopen(url)
content = sock.read() 
sock.close()

print content

这是法国的一种说法，。。stackoverflow.com河畔你好

更新1 事实上，我现在更喜欢使用以下代码，因为它更快：

# Python 2.7

import httplib

conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)

req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'

try:
    conn.request('GET',req)
except:
     print 'echec de connexion'

content = conn.getresponse().read()

print content

将此代码中的

httplib

更改为

http.client

，应该足以使其适应Python 3

我确认，通过这两个代码，我获得了您感兴趣的数据的源代码：

        <td class="L20" width="33%" align="center">11:57:44</td>

        <td class="L20" width="33%" align="center">1.4486</td>

        <td class="L20" width="33%" align="center">0</td>

</tr>

                                        <tr>

        <td  width="33%" align="center">11:57:43</td>

        <td  width="33%" align="center">1.4486</td>

        <td  width="33%" align="center">0</td>

</tr>

更新2 将以下代码段添加到上述代码将允许您提取所需的数据：

for i,line in enumerate(content.splitlines(True)):
    print str(i)+' '+repr(line)

print '\n\n'


import re

regx = re.compile('\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d\d:\d\d:\d\d)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">([\d.]+)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d+)</td>\r\n')

print regx.findall(content)

对于i，枚举中的行（content.splitlines（True））：
打印str（i）+''+repr（行）
打印“\n\n”
进口稀土
regx=re.compile（'\t\t\t\t\t\t（\d\d:\d\d:\d\d）\r\n'
'\t\t\t\t\t\t（[\d.]+）\r\n'
“\t\t\t\t\t\t（\d+）\r\n”
打印regx.findall（内容）

结果（仅结尾）

。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
.......................................
.......................................
.......................................
98'window.config.graphics={}\不
99'window.config.accordions={}\不
100'\n'
101“window.addEvent（'domready'，function（）{\n”
102 '});\不
103'\n'
104'\n'
105'\t\t\t\tsas\u tmstp=Math.round（Math.random（）*1000000000）\不
106'\t\t\t\tsas_pageid=“177/（包括/cours/last_事务）”；//页面：boursorama.com/smartad\u test\n'
107'\t\t\t\tvar sas_formatids=“8968”\不
108'\t\t\t\tsas_target=“symb=1xEURUS#”；//TargetingArray\n'
109'\t\t\t\t文件。写入（“”）\t\t\t\t\n'
110'\t\t\tsas_脚本（18968）\r\n'
111“\twindow.addEvent（'domready'，function（）{\r\n”
112'sas_move（18968）；\t}）\r\n'
113'\n'
114'\n'
115'var | gaq=|gaq |[]\不
116“_gaq.push（[''u setAccount'，'UA-1623710-1']）；\n”
117“_gaq.push（[''u setDomainName'，'www.boursorama.com']）；\n”
118“_gaq.push（[''u setCustomVar'，1'，segment'，WEB-VISITOR']）；\n”
119“_gaq.push（[''u setCustomVar'，4'，version'，18']）；\n”
120“_gaq.push（[''跟踪页面加载时间]）；\n”
121“\u gaq.push（[''\u trackPageview']）；\n”
122'（函数（）{\n'
123“var ga=document.createElement（'script'）；ga.type='text/javascript'；ga.async=true；\n”
124“ga.src=（'https:'==document.location.protocol？“https://ssl' : 'http://www“）+”.google analytics.com/ga.js“；\n”
125“var s=document.getElementsByTagName（'script'）[0]；s.parentNode.insertBefore（ga，s）；\n”
126 '})();\不
127'\n'
128'\n'
129 ''
[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

我希望你不打算“玩”外汇交易：这是快速释放资金的最佳方式之一

更新3 对不起！我忘了你是Python 3的。所以我认为你必须这样定义正则表达式：

regx=re.compile（b'\t\t\t\t……）

也就是说，在字符串前面加上b，否则会出现类似于中的错误。我怀疑发生的情况是服务器正在发送压缩数据，而没有告诉您它正在这样做。Python的标准HTTP库无法处理压缩格式。
我建议使用httplib2，它可以处理压缩格式（通常比urllib好得多）

打印（响应）

显示服务器的响应：
{'status'：'200'，'content length'：'7787'，'x-sid'：'26，E'，'content language'：'fr'，'set cookie'：'PHPSESSIONID=ed45f761542752317963ab4762ec604f；path=/；domain=.www.boursorama.com'，'expires'：'Thu，1981年11月19日08:52:00 GMT'，'vary'：'Accept Encoding，User Agent'，'server'：'nginx'，'connection'：'keep alive'，'keep-content Encoding'：'gzip'，'prag'，'ma'：'无缓存'，'缓存控制'：'无存储，无缓存，必须重新验证，后检查=0，前检查=0'，'日期'：'2011年8月23日星期二10:26:46 GMT'，'内容类型'：'文本/html；字符集=ISO-8859-1'，'内容位置'：'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'} 虽然这并不能证实它是压缩的（我们现在告诉服务器，毕竟我们可以处理压缩），但它确实为这一理论提供了一些依据

你猜到，实际的内容存在于，

内容

。简单地看一下它，我们就知道它在工作（我只想粘贴一点）：

b'这是一个输入错误，我真正的代码包括“”使其成为字符串。。它不起作用。我的问题是一个真实的问题question@jazz可能是服务器在发送压缩的数据，urllib有点受限。@Robert:不幸的是没有；响应为空，在Python2中也可以正常工作。它必须特定于Python 3和urllib.request。@jazz但内容长度
在response=urllib.request.urlopen（url）上是7787，这表明发送的内容不仅仅是标题——使用在线工具查看它时，显示的内容长度相同，但实际的源代码也相同，暗示这不是一个空洞的回答。是的，urllib.request.Merci有点奇怪，我今晚才能测试代码。你确定你得到了html源代码吗？是的，我在python 3上；-）@主销是的，我明白了，我
.......................................
.......................................
.......................................
.......................................
98 'window.config.graphics = {};\n'
99 'window.config.accordions = {};\n'
100 '\n'
101 "window.addEvent('domready', function(){\n"
102 '});\n'
103 '</script>\n'
104 '<script type="text/javascript">\n'
105 '\t\t\t\tsas_tmstp = Math.round(Math.random()*10000000000);\n'
106 '\t\t\t\tsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_test\n'
107 '\t\t\t\tvar sas_formatids = "8968";\n'
108 '\t\t\t\tsas_target = "symb=1xEURUS#"; // TargetingArray\n'
109 '\t\t\t\tdocument.write("<scr"+"ipt src=\\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\\"></scr"+"ipt>");\t\t\t\t\n'
110 '\t\t\t</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">\r\n'
111 "\twindow.addEvent('domready', function(){\r\n"
112 'sas_move(1,8968);\t});\r\n'
113 '</script>\n'
114 '<script type="text/javascript">\n'
115 'var _gaq = _gaq || [];\n'
116 "_gaq.push(['_setAccount', 'UA-1623710-1']);\n"
117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);\n"
118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);\n"
119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);\n"
120 "_gaq.push(['_trackPageLoadTime']);\n"
121 "_gaq.push(['_trackPageview']);\n"
122 '(function() {\n'
123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n"
124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n"
125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n"
126 '})();\n'
127 '</script>\n'
128 '</body>\n'
129 '</html>'



[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

import httplib2
folder = httplib2.Http('.cache')
response, content = folder.request("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")