Python 如何在不包含站点数据的情况下从网页中获取准确的标题_Python

Python 如何在不包含站点数据的情况下从网页中获取准确的标题

python

Python 如何在不包含站点数据的情况下从网页中获取准确的标题,python,Python,我发现[和其他一些]有一点关于阅读html的BeautifulSoup。它主要是做我想做的，为一个网页抓取一个标题 def get_title(url): html = requests.get(url).text if len(html) > 0: contents = BeautifulSoup(html) title = contents.title.string return title return None

我发现[和其他一些]有一点关于阅读html的BeautifulSoup。它主要是做我想做的，为一个网页抓取一个标题

def get_title(url):
    html = requests.get(url).text
    if len(html) > 0:
        contents = BeautifulSoup(html)
        title = contents.title.string
        return title
    return None

我遇到的问题是，有时文章返回时会在结尾附加元数据“-some_data”。一个很好的例子是BBC体育频道的一篇文章，文章的标题是

杰克·查尔顿：1966年英格兰世界杯冠军去世，享年85岁-BBC体育

我可以做一些简单的事情，比如在最后一个'-'字符之后剪掉任何东西

title = title.rsplit(', ', 1)[0]

但这假设任何元都存在于“-”值之后。我不想假设永远不会有一篇文章的标题以“-part_of_title”结尾

我找到了，但它肯定比我需要的更多-我所需要的是抓住一个标题，并确保它与用户发布的内容相同。我的朋友给我指了指报纸3K，他也提到它可能有问题，而且并不总能正确找到标题，所以如果可能的话，我倾向于使用其他东西

我现在的想法是继续使用BeautifulSoup，只需添加一个插件，它也可以帮助解决一些拼写错误或标点符号差异。但是，我当然更愿意从一个地方开始，包括与准确的标题进行比较。

以下是reddit如何处理标题数据

def extract_title（数据）：
“”“尝试从HTML字符串中提取页面标题。
og:title元标记是首选，但会退回到使用
如果找不到标记，则替换为标记。如果使用，
还尝试从末尾删除站点名称。
"""
bs=BeautifulSoup（数据，convertEntities=BeautifulSoup.HTML\u实体）
如果不是bs或不是bs.html.head：
返回
head\u soup=bs.html.head
标题=无
#尝试查找要使用的og:title元标记
og_title=（head_soup.find（“meta”，attrs={“property”：“og:title”}）或
head_soup.find（“meta”，attrs={“name”：“og:title”}））
如果是og_标题：
title=og_title.get（“内容”）
#如果失败，请寻找要使用的标记
如果不是title和head_soup.title和head_soup.title.string：
title=head\u soup.title.string
#删除可能是站点名称的结束部分
#在字符串中的空格之间查找最后一个分隔符字符
#分隔符：|、-、emdash、endash、，
#左右双角度引号
反向标题=标题[：-1]
to_trim=re.search（u'\s[\u00ab\u00bb\u2013\u2014 |-]\s'，
反向标题，
flags=re.UNICODE）
#只有当它不能获得超过一半的冠军时，才进行修剪
如果要修剪和修剪。结束（）

您已经有了准确的标题。该标题由网站的创建者选择。如果你想只取标题的相关部分，那就需要定义什么是相关的。即使是人也无法可靠地完成每一页，更不用说计算机了。你可以做的是检测到许多页面的一部分（开始或结束）总是相同的，然后删除该部分。它在不同站点之间不一致，所以我不一定有一个模式。但是，虽然我可能有一个“准确”的标题，但用户在单击链接时看到的标题并不包括该标题。我想要的是当用户打开链接时，面对相同问题时看到的标题。谢谢你的代码；这很有帮助！

def extract_title(data):
    """Try to extract the page title from a string of HTML.
    An og:title meta tag is preferred, but will fall back to using
    the <title> tag instead if one is not found. If using <title>,
    also attempts to trim off the site's name from the end.
    """
    bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
    if not bs or not bs.html.head:
        return
    head_soup = bs.html.head

    title = None

    # try to find an og:title meta tag to use
    og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
                head_soup.find("meta", attrs={"name": "og:title"}))
    if og_title:
        title = og_title.get("content")

    # if that failed, look for a <title> tag to use instead
    if not title and head_soup.title and head_soup.title.string:
        title = head_soup.title.string

        # remove end part that's likely to be the site's name
        # looks for last delimiter char between spaces in strings
        # delimiters: |, -, emdash, endash,
        #             left- and right-pointing double angle quotation marks
        reverse_title = title[::-1]
        to_trim = re.search(u'\s[\u00ab\u00bb\u2013\u2014|-]\s',
                            reverse_title,
                            flags=re.UNICODE)

        # only trim if it won't take off over half the title
        if to_trim and to_trim.end() < len(title) / 2:
            title = title[:-(to_trim.end())]

    if not title:
        return

    # get rid of extraneous whitespace in the title
    title = re.sub(r'\s+', ' ', title, flags=re.UNICODE)

    return title.encode('utf-8').strip()