Python2 urllib2库的魔力'；s读取方法_Python_Python 2.7_Urllib

Python2 urllib2库的魔力'；s读取方法

python python-2.7

Python2 urllib2库的魔力'；s读取方法,python,python-2.7,urllib,Python,Python 2.7,Urllib,我的程序本来应该刮一堆网页。我们有一个常量字符串和生成的字符串，它们是相同的。但是作为每个网页的代码的文本字符串突然不相等了代码如下： import urllib2 def generate_list_of_public_urls(): response = urllib2.urlopen("http://vk.com/wall-54530371_2") error = response.read() gen_str = "http://vk.com/wall-54

我的程序本来应该刮一堆网页。我们有一个常量字符串和生成的字符串，它们是相同的。但是作为每个网页的代码的文本字符串突然不相等了

代码如下：

import urllib2

def generate_list_of_public_urls():
    response = urllib2.urlopen("http://vk.com/wall-54530371_2")
    error = response.read()

    gen_str = "http://vk.com/wall-54530371_" + str(2)
    response = urllib2.urlopen(gen_str)
    html = response.read()
    print gen_str == "http://vk.com/wall-54530371_2"
    print error == html

generate_list_of_public_urls()

输出为：

True
False

即使页面的布局没有变化，甚至看起来内容没有变化，也要查看页面源代码

至少，JavaScript中有一部分可以帮助提供带有时间戳的广告：

<script type="text/javascript">
var vk = {
  ads_rotate_interval: 120000,
  al: parseInt('3') || 4,
  id: 0,
  intnat: '1' ? true : false,
  host: 'vk.com',
  ...
  ts: 1404931575,
  pd: 0,
  pads: 1,
  time: [2014, 7, 9, 22, 46, 15]
}


变量vk={
ads_旋转_间隔：120000，
al:parseInt（'3'）| | 4，
id:0，
intnat:'1'？对：错，
主持人：“vk.com”，
...
ts:1404931575，
pd:0，
港口及机场发展策略:1,，
时间：[2014,7,9,22,46,15]
}

正如@vaultah在他的评论中指出的那样，页面内容确实发生了变化。如果您试图刮取数据，可以使用VK的API，或者使用BeautifulSoup之类的工具在页面上更具体地针对特定的div来解析内容。

您意识到页面可能会发生变化吗？VK有API，顺便说一句@vaultah，但此页面没有更改。请检查输出。它确实改变了。