Python &；英镑在urllib2和美丽的汤中显示_Python_Encoding_Beautifulsoup_Urllib2

Python &；英镑在urllib2和美丽的汤中显示

python encoding

Python &；英镑在urllib2和美丽的汤中显示,python,encoding,beautifulsoup,urllib2,Python,Encoding,Beautifulsoup,Urllib2,我正试图用python编写一个小型web scraper，我想我遇到了一个编码问题。我正在尝试刮（特别是页面上的表）-一行可能看起来像这样- <tr> <td style="width:64.9%;height:11px;"> <p><strong>the great escape 2017  local early bird tickets, selling fast</stron

我正试图用python编写一个小型web scraper，我想我遇到了一个编码问题。我正在尝试刮（特别是页面上的表）-一行可能看起来像这样-

    <tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>&pound;55.00</strong></p>
        </td>
       </tr>


2017年大逃亡本地早起鸟门票，销售迅速
18&ndash；5月20日
各种
英镑；55.00

我实际上是在尝试替换

£；55.00英镑，以及任何其他“非文字”污点
我尝试了一些不同的编码方法，你可以使用beautifulsoup和urllib2，但都没有用，我想我只是做错了
谢谢
我为此使用了请求
，但希望您也可以使用urllib2
来实现这一点。下面是代码：
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests 
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(requests.get('your_url').text)
chart = soup.findAll(name='tr') 
print str(chart).replace('&pound;',unichr(163)) #replace '&pound;' with '£'

现在您应该获得预期的输出
样本输出：
...
<strong>£71.50</strong></p>
...

。。。
71.50英镑
...

不管怎样，关于解析，你可以用很多方法来完成，这里有趣的是：print str（chart）.replace（'pound；'，unichr（163））
，这很有挑战性：）
更新
如果您想转义多个（甚至一个）字符（如破折号、磅等），那么使用解析器将更容易/更有效，就像Padraic的答案一样。有时，您也会在评论中看到它们处理的问题和其他编码问题
 您想取消html的显示，可以在python3中使用html.unescape：
In [14]: from html import unescape

In [15]: h = """<tr>
   ....:         <td style="width:64.9%;height:11px;">
   ....:          <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
   ....:         </td>
   ....:         <td style="width:13.1%;height:11px;">
   ....:          <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
   ....:         </td>
   ....:         <td style="width:15.42%;height:11px;">
   ....:          <p><strong>various</strong></p>
   ....:         </td>
   ....:         <td style="width:6.58%;height:11px;">
   ....:          <p><strong>&pound;55.00</strong></p>
   ....:         </td>
   ....:        </tr>"""

In [16]: 

In [16]: print(unescape(h))
<tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>£55.00</strong></p>
        </td>
       </tr>

[14]中的：从html导入unescape
在[15]：h=“”
....:         
2017年《大逃亡》地方早起鸟票，销售迅速
....:         
....:         
五月十八日至二十日
....:         
....:         
各种各样的
....:         
....:         
…英镑55.00英镑
....:         
....:        """
在[16]中：
In[16]：打印（unescape（h））
2017年大逃亡本地早起鸟门票，销售迅速
5月18日至20日
各种
55.00英镑

对于python2使用：
In [6]: from html.parser import HTMLParser

In [7]: unescape = HTMLParser().unescape  

In [8]: print(unescape(h))
<tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>£55.00</strong></p>
        </td>

[6]中的：从html.parser导入HTMLParser
在[7]中：unescape=HTMLParser（）.unescape
In[8]：打印（unescape（h））
2017年大逃亡本地早起鸟门票，销售迅速
5月18日至20日
各种
55.00英镑

您可以正确地看到取消对所有实体的扫描，而不仅仅是磅符号。
这不是您想要取消对html的扫描，这意味着为页面上的每个转义实体调用replace，并且初始str本身也可能导致编码错误。我也不鼓励使用BeautifulSoup3。我尊重你的评论，但如果你看看这里，我会不同意你的看法：你会看到那些现成的LIB和我在一行代码中做的一样，不同之处在于它们给你现成的结果，我个人不赞成。在某些情况下，他们会做这项工作，但这是一项非常具体和简单的任务。至于bs3
而不是bs4
，对于OP想要做什么并不重要。但我还是尊重你的意见！我基本上是在试图取代英镑；55.00英镑，以及任何其他“非文字”污点。。其他“非文本”污点是转义实体，可以是众多实体中的任何一个。同样重要的是，bs3坏了，不再维护了。@PadraicCunningham，好吧，我想你是对的。我真的没有注意到“和任何其他‘非文本’问题”，如果他需要为许多实体这样做，我也会使用解析器——如您的示例所示——我将重新编辑我的答案，感谢您的反馈/建议：）