在python中使用findall无法获得预期的结果_Python_Regex_Web Scraping

在python中使用findall无法获得预期的结果

python regex web-scraping

在python中使用findall无法获得预期的结果,python,regex,web-scraping,Python,Regex,Web Scraping,我是python新手（使用2.7.3）。我曾尝试使用python进行web抓取，但没有得到预期的结果： import urllib import re regex='<title>(.+?)<\title>' pattern=re.compile(regex) dummy="fsdfsdf<title>Test<\title>dsf" html=urllib.urlopen('http://www.google.com') text=html.re

我是python新手（使用2.7.3）。我曾尝试使用python进行web抓取，但没有得到预期的结果：

import urllib
import re
regex='<title>(.+?)<\title>'
pattern=re.compile(regex)
dummy="fsdfsdf<title>Test<\title>dsf"
html=urllib.urlopen('http://www.google.com')
text=html.read()
print pattern.findall(text)
print pattern.findall(dummy)

导入urllib
进口稀土
正则表达式='（.+？）'
pattern=re.compile（regex）
dummy=“fsdfstestdsf”
html=urllib.urlopen（'http://www.google.com')
text=html.read（）
打印模式。findall（文本）
打印模式。findall（虚拟）

虽然第二个print语句工作正常，但第一个语句应该打印Google，但它给出了一个空白列表。

您输入的斜杠错误：

regex='<title>(.+?)<\title>'

尝试更改：

regex='<title>(.+?)<\title>'

regex='（.+？）'

到

regex='（.+？）'

使用html解析器，而不是regex为什么不使用or？当您在寻找帮助您编写代码的人时，使用更具描述性的变量名也是一种很好的做法。谢谢，这真是一个愚蠢的错误。在阅读python文档中的所有re包时浪费了很多时间，但没有检查我的正则表达式。这就是为什么要使用HTML解析器而不是正则表达式来解析HTML。

from bs4 import BeautifulSoup

response = urllib.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text

regex='<title>(.+?)<\title>'

regex='<title>(.+?)</title>'