Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/331.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用urllib2进行Web抓取_Python_Python 2.7_Rss_Urllib2_Urllib - Fatal编程技术网

Python 使用urllib2进行Web抓取

Python 使用urllib2进行Web抓取,python,python-2.7,rss,urllib2,urllib,Python,Python 2.7,Rss,Urllib2,Urllib,我正试图从这个RSS源中删除所有标题: 这是我的相同代码: import urllib2 import re content = urllib2.urlopen('http://www.quora.com/Python-programming-language-1/rss').read() allTitles = re.compile('<title>(.*)</title>') list = re.findall(allTitles,content) for e i

我正试图从这个RSS源中删除所有标题:

这是我的相同代码:

import urllib2
import re
content = urllib2.urlopen('http://www.quora.com/Python-programming-language-1/rss').read()
allTitles =  re.compile('<title>(.*)</title>')
list = re.findall(allTitles,content)
for e in range(0, 2):
    print list[e]
导入urllib2
进口稀土
content=urlib2.urlopen('http://www.quora.com/Python-programming-language-1/rss)。读()
allTitles=re.compile(“(.*”)
list=re.findall(所有标题、内容)
对于范围(0,2)内的e:
打印列表[e]

然而,我没有得到一个标题列表作为输出,而是从rss源代码中得到了一堆代码。我做错了什么?

应该在表达式中使用非贪婪标记(?):

#allTitles =  re.compile('<title>(.*)</title>')
allTitles =  re.compile('<title>(.*?)</title>')
#allTitles=re.compile('(.*))
allTitles=re.compile(“(.*?”)

如果没有(.*)组中除最后一个
之外的所有文本…

如前所述,您的代码缺少用于regexp的贪婪说明符,可以使用它进行修复。但我强烈建议从正则表达式切换到更适合xml解析的工具,如,或专门的rss解析模块,如

例如,查看如何使用lxml完成任务:

>>> import lxml.etree
>>> rss = lxml.etree.fromstring(content)
>>> titles = rss.findall('.//title')
>>> print '\n'.join(title.text for title in titles[:2])
Questions About Python (programming language) on Quora
Could someone explain for me the following Python function that uses @wraps from functools?

如果我在代码中添加了非贪婪标记,我只会从该链接中提取前两个标题。我如何提取所有嵌入在下面的文本?哦,是的!我的错。感谢您如此准确、迅速的回答:)