为什么赢了'；Python正则表达式在格式化的HTML字符串上工作吗？_Python_Regex

为什么赢了'；Python正则表达式在格式化的HTML字符串上工作吗？

python regex

为什么赢了'；Python正则表达式在格式化的HTML字符串上工作吗？,python,regex,Python,Regex,我可以很好地下载和打印html，但是当我添加最后两行时，它总是中断我得到这个错误： from bs4 import BeautifulSoup import urllib import re soup = urllib.urlopen("http://atlanta.craigslist.org/cto/") soup = BeautifulSoup(soup) souped = soup.p print souped m = re.search("\\$.",souped) print m

我可以很好地下载和打印html，但是当我添加最后两行时，它总是中断

我得到这个错误：

from bs4 import BeautifulSoup
import urllib
import re

soup = urllib.urlopen("http://atlanta.craigslist.org/cto/")
soup = BeautifulSoup(soup)
souped = soup.p
print souped
m = re.search("\\$.",souped)
print m.group(0)

回溯（最近一次呼叫最后一次）：
RunScript中的文件“C:\Python27\Lib\site packages\pythonwin\framework\scriptutils.py”，第323行
run（codeObject，main，dict，start，stepping=0）
文件“C:\Python27\Lib\site packages\pythonwin\pywin\debugger\\uuuu init\uuuu.py”，第60行，正在运行
_GetCurrentDebugger（）.run（cmd、globals、locals、start_步进）
文件“C:\Python27\Lib\site packages\pythonwin\pywin\debugger\debugger.py”，第655行，正在运行
全局、局部中的exec cmd
文件“C:\Users\Zack\Documents\Scripto.py”，第1行，在
从bs4导入BeautifulSoup
文件“C:\Python27\lib\re.py”，第142行，搜索中
返回编译（模式、标志）。搜索（字符串）
TypeError:应为字符串或缓冲区

非常感谢

您可能需要

re.search（\\$”，str（souped））

因为

souped

是一个对象，而

print

将其转换为文本。但是，如果您想在另一个上下文中使用它（就像您所做的那样，作为文本），您应该首先将其转换为

str（souped）

或

unicode（souped）

，如果它是unicode字符串

您可以将正则表达式作为搜索条件传递给：

soup.p

返回一个

标记

对象。您可以使用将其转换为字符串：

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen # from urllib.request import urlopen
>>> import re
>>> page = urlopen("http://atlanta.craigslist.org/cto/")
>>> soup = BeautifulSoup(page)
>>> soup.find('p', text=re.compile(r"\$."))
' -\n\t\t\t $7500'

p=soup.p >>>str（p） “

\n\xa0\n-\n\ t\t\t$7500（布福德）图片 \xa0img
\n

' >>>检索（r“\$”，str（p））.组（0） '$7'

为了扩展这一点，BeautifulSoup对象有一个

\uuuu str\uuuu（）

方法将它们转换为字符串，因此可以很好地打印它们（因为

print

会自动执行），但它们实际上不是字符串，并且

re.search（）

需要一个字符串。因此，您必须显式地将HTML转换为字符串，以便可以搜索它。+1，但如果可能，我将使用unicode（），而不是str。并添加re.U标志。

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen # from urllib.request import urlopen
>>> import re
>>> page = urlopen("http://atlanta.craigslist.org/cto/")
>>> soup = BeautifulSoup(page)
>>> soup.find('p', text=re.compile(r"\$."))
' -\n\t\t\t $7500'

>>> p = soup.p
>>> str(p)
'<p class="row">\n<span class="ih" id="images:5Nb5I85J83N73p33H6
c2pd3447d5bff6d1757.jpg">\xa0</span>\n<a href="http://atlanta.cr
aigslist.org/nat/cto/2870295634.html">2000 Lexus RX 300</a> -\n\
t\t\t $7500<font size="-1"> (Buford)</font> <span class="p"> pic
\xa0img</span><br class="c" />\n</p>'
>>> re.search(r"\$.", str(p)).group(0)
'$7'