Python 3.x 刮论坛：不能刮有表格的帖子_Python 3.x_Web Scraping_Beautifulsoup

Python 3.x 刮论坛：不能刮有表格的帖子

python-3.x web-scraping

Python 3.x 刮论坛：不能刮有表格的帖子,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我几乎写完了我的第一本刮刀然而，我遇到了一个障碍：我似乎无法获取包含表的帖子的内容（换句话说，引用另一篇帖子的帖子）这是从soup对象提取post内容的代码。它工作得很好： def getPost_contents(soup0bj): try: soup0bj = (soup0bj) post_contents = [] for content in soup0bj.findAll('', {'class

我几乎写完了我的第一本刮刀

然而，我遇到了一个障碍：我似乎无法获取包含表的帖子的内容（换句话说，引用另一篇帖子的帖子）

这是从soup对象提取post内容的代码。它工作得很好：

    def getPost_contents(soup0bj):

        try:
          soup0bj = (soup0bj)
          post_contents = []

          for content in soup0bj.findAll('', {'class' : 'post_content'}, recursive = 'True'):
             post_contents.append(content.text.strip())

         ...#Error management

         return (post_contents)

下面是一个我需要刮的示例（以黄色突出显示）：

（URL，以防万一：）

如何获取我突出显示的内容？为什么我当前的GetPostContent函数在这个特定实例中不起作用？就我所见，字符串仍然在div class=post_内容下

编辑编辑编辑

这就是我如何得到我的美丽之群：

    from bs4 import BeautifulSoup as Soup

    def getHTMLsoup(url):

       try:
          html = urlopen(url)
       ...#Error management

       try:
          soup0bj = Soup(html.read().decode('utf-8', 'replace'))
          time.sleep(5)
       ...#Error management

       return (soup0bj)

编辑2编辑2编辑2编辑2

这些是刮刀的相关部分：（很抱歉倾倒！）

问题在于您的解码，它不是utf-8，如果您删除

“replace”

，您的代码将出现以下错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 253835: invalid continuation byte

数据似乎是拉丁-1编码的，解码为拉丁-1不会导致错误，但输出在某些部分看起来会有偏差，使用

 html  = urlopen(r).read().decode("latin-1")

将工作，但正如我提到的，您会得到奇怪的输出，如：

"diabĂ¨te en cas d'accident de la route ou malaise isolĂŠ ou autre ???"

另一个选项是传递接受字符集头：

from urllib.request import Request, urlopen
headers = {"accept-charset":"utf-8"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
html  =  urlopen(r).read()

我使用请求得到完全相同的编码问题，让它处理编码，这就像数据有混合编码，一些utf-8和一些拉丁语-1。从请求返回的标头将内容编码显示为gzip，如下所示：

 'Content-Encoding': 'gzip'

如果我们指定需要gzip和解码：

from urllib.request import Request, urlopen
headers = {"Accept-Encoding":"gzip"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
r = urlopen(r)

import gzip
gzipFile = gzip.GzipFile(fileobj=r)

print(gzipFile.read().decode("latin-1"))

我们用utf-8得到同样的错误，用拉丁语-1得到同样奇怪的输出解码。有趣的是，在python2中，请求和urllib都可以正常工作

使用chardet：

r = urlopen(r)
import chardet
print(chardet.detect(r.read()))

大约有71%的人相信它是ISO-8859-2，但这同样会产生同样糟糕的结果

{'confidence': 0.711104254322944, 'encoding': 'ISO-8859-2'}

您当前的输出是多少？我看到这篇文章和其他文章一样被摘录了。对我来说也很有用，尽管它仍然停留在以前的文本中。嘿，谢谢你的提问！现在，它输出所有的帖子，除了那些有一个div class=post_contents中的have a table的帖子，我的问题是什么？猜猜看？@Gabriel，什么版本的Beautifulsoup？你是如何得到html的？谢谢！你能澄清一下“headers”和urlopen（r）.read（）在我的代码中的位置吗？我（还）不熟悉这种方式。@Gabriel，按照我在答案中的方式使用它们。@Gabriel，别担心，你也看到奇怪的输出了吗？是的，只有当我要求它编码为utf-8时，我才明白。不知道身处不同的地区是否与此有关。@Gabriel，你没有看到它用replace解码为utf-8的唯一原因可能是因为你丢失了它，你看到的输出像diabĂĂte吗？

{'confidence': 0.711104254322944, 'encoding': 'ISO-8859-2'}