试图从网页Python和BeautifulSoup获取编码_Python_Character Encoding_Beautifulsoup_Webpage

试图从网页Python和BeautifulSoup获取编码

python character-encoding

试图从网页Python和BeautifulSoup获取编码,python,character-encoding,beautifulsoup,webpage,Python,Character Encoding,Beautifulsoup,Webpage,我正在尝试从网页中检索字符集（这将一直更改）。目前，我正在使用beautifulSoup解析页面，然后从标题中提取字符集。在我遇到一个有 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 有人知道如何添加到该代码中以从上面的示例中检索字符集吗。将其标记化并尝试以这种方式检索字符集是一个好主意吗？在不改变整个功能的情况下，你会怎么做呢？现在，上面的代码返回“text/html；charse

我正在尝试从网页中检索字符集（这将一直更改）。目前，我正在使用beautifulSoup解析页面，然后从标题中提取字符集。在我遇到一个有

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

有人知道如何添加到该代码中以从上面的示例中检索字符集吗。将其标记化并尝试以这种方式检索字符集是一个好主意吗？在不改变整个功能的情况下，你会怎么做呢？现在，上面的代码返回“text/html；charset=utf-8”，这会导致LookupError，因为这是未知编码

谢谢

我最后使用的代码是：

    def get_encoding(soup):
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                content = soup.meta.get('content')
                match = re.search('charset=(.*)', content)
                if match:
                    encod = match.group(1)
                else:
                    dic_of_possible_encodings = chardet.detect(unicode(soup))
                    encod = dic_of_possible_encodings['encoding'] 
    return encod

在我的例子中，

soup.meta

只返回在soup中找到的第一个

meta

-标记。下面是@Fruit的答案，它可以在给定的

html

中的任何

meta

-标记中找到

charset

from bs4 import BeautifulSoup
import re

def get_encoding(soup):
    encoding = None
    if soup:
        for meta_tag in soup.find_all("meta"):
            encoding = meta_tag.get('charset')
            if encoding: break
            else:
                encoding = meta_tag.get('content-type')
                if encoding: break
                else:
                    content = meta_tag.get('content')
                    if content:
                        match = re.search('charset=(.*)', content)
                        if match:
                           encoding = match.group(1)
                           break
    if encoding:
        # cast to str if type(encoding) == bs4.element.ContentMetaAttributeValue
        return str(encoding).lower()

soup = BeautifulSoup(html)
print(get_encoding_from_meta(soup))

我使用过chardet，但我希望100%准确，因此我想尝试从页面本身获取编码。太棒了。非常感谢。我真的需要学点正则表达式。

import re
def get_encoding(soup):
    if soup and soup.meta:
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                content = soup.meta.get('content')
                match = re.search('charset=(.*)', content)
                if match:
                    encod = match.group(1)
                else:
                    raise ValueError('unable to find encoding')
    else:
        raise ValueError('unable to find encoding')
    return encod

from bs4 import BeautifulSoup
import re

def get_encoding(soup):
    encoding = None
    if soup:
        for meta_tag in soup.find_all("meta"):
            encoding = meta_tag.get('charset')
            if encoding: break
            else:
                encoding = meta_tag.get('content-type')
                if encoding: break
                else:
                    content = meta_tag.get('content')
                    if content:
                        match = re.search('charset=(.*)', content)
                        if match:
                           encoding = match.group(1)
                           break
    if encoding:
        # cast to str if type(encoding) == bs4.element.ContentMetaAttributeValue
        return str(encoding).lower()

soup = BeautifulSoup(html)
print(get_encoding_from_meta(soup))