如何使用Python3和BeautifulSoup获取Wikipedia文章的文本？_Python_Html_Web Scraping_Beautifulsoup_Wikipedia

如何使用Python3和BeautifulSoup获取Wikipedia文章的文本？

python html web-scraping

如何使用Python3和BeautifulSoup获取Wikipedia文章的文本？,python,html,web-scraping,beautifulsoup,wikipedia,Python,Html,Web Scraping,Beautifulsoup,Wikipedia,我用Python 3编写了这个脚本： response = simple_get("https://en.wikipedia.org/wiki/Mathematics") result = {} result["url"] = url if response is not None: html = BeautifulSoup(response, 'html.parser') title = html.select("#firstHeading")[0].text 正如你所看到的

我用Python 3编写了这个脚本：

response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
    html = BeautifulSoup(response, 'html.parser')
    title = html.select("#firstHeading")[0].text

正如你所看到的，我可以从文章中获得标题，但我不知道如何从希腊语μά中获得数学文本。。。到目录…

选择标签。共有52个要素。不确定您是否想要全部内容，但您可以遍历这些标记来存储它。我只是选择打印它们来显示输出

import bs4
import requests


response = requests.get("https://en.wikipedia.org/wiki/Mathematics")

if response is not None:
    html = bs4.BeautifulSoup(response.text, 'html.parser')

    title = html.select("#firstHeading")[0].text
    paragraphs = html.select("p")
    for para in paragraphs:
        print (para.text)

    # just grab the text up to contents as stated in question
    intro = '\n'.join([ para.text for para in paragraphs[0:5]])
    print (intro)

使用图书馆维基百科

从维基百科获取信息有一种非常非常简单的方法——维基百科API

有，它允许您仅在零HTML解析的情况下在几行中完成：

import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

page = wiki_wiki.page('Mathematics')
print(page.summary)

印刷品：

希腊数学μάθημαmáthēma，知识，学习，学习包括对数量、结构、空间和空间等主题的研究更改…故意省略

一般来说，如果有直接API可用，请尽量避免屏幕抓取。

使用lxml库可以获得所需的输出，如下所示

import requests
from lxml.html import fromstring

url = "https://en.wikipedia.org/wiki/Mathematics"

res = requests.get(url)
source = fromstring(res.content)
paragraph = '\n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)

使用BeautifulSoup：

您似乎想要的是没有周围导航元素的HTML页面内容。正如我在中所描述的，至少有两种方法可以获得它：

在您的情况下，最简单的方法可能是在URL中包含参数action=render，如中所示。这将只提供HTML内容，而不提供其他内容

或者，您也可以通过获取页面内容，如中所示

使用API的优点是，它还可以为您提供有关可能有用的页面的信息。例如，如果您想要一个通常显示在页面侧栏中的中介语链接列表，或者通常显示在内容区域下方的类别，您可以从API中获得如下内容：

要获得具有相同请求的页面内容，请使用prop=langlinks | categories | text

尽管它们支持的功能集可能会有所不同，但有几种方法可以自动化使用它的一些基本细节。也就是说，直接从代码中使用API而不使用库也是完全可能的。

要获得正确的函数使用方法，您可以使用Wikipedia提供的JSON API：

from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads


def getJSON(page):
    params = urlencode({
        'format': 'json',
        'action': 'parse',
        'prop': 'text',
        'redirects' : 'true',
        'page': page})
    API = "https://en.wikipedia.org/w/api.php"
    response = urlopen(API + "?" + params)
    return response.read().decode('utf-8')


def getRawPage(page):
    parsed = loads(getJSON(page))
    try:
        title = parsed['parse']['title']
        content = parsed['parse']['text']['*']
        return title, content
    except KeyError:
        # The page doesn't exist
        return None, None

title, content = getRawPage("Mathematics")

然后，您可以使用任何库解析它，以提取所需内容：

我使用这个：通过“idx”，我可以确定我要阅读的段落

from from bs4 import BeautifulSoup
import requests

res = requests.get("https://de.wikipedia.org/wiki/Pferde")
soup = BeautifulSoup(res.text, 'html.parser')
for idx, item in enumerate(soup.find_all("p")):
    if idx == 1:
        break
print(item.text)

如果响应不是，则可以将“无”重写为“如果响应”。此外，由于内容可能会在将来发生变化，我建议获取整个div，只读取p，并在到达具有类toclimit的div时停止-3@PinoSan我认为明确地检查一下没有什么不好。例如bool不是None与bool不一样。但是，在这种情况下，无检查是完全不必要的，因为响应将始终是requests.models.response对象。如果请求失败，将引发异常。@t.m.adam您所说的是真的，但正如您所说的，响应不是字符串。所以您只想检查它是否是有效对象，而不是空字符串、无或空字典。。。关于例外情况，我同意我们应该检查网络错误时的例外情况，但我们也应该检查状态代码是否正确200@PinoSan当然，我也更喜欢if响应方式，但是你知道。if响应的问题是，它可能会产生奇怪的错误，难以调试。但是，是的，在大多数情况下，一个简单的布尔检查就足够了。仅仅因为你可以刮页面，并不意味着你应该这样做。Wikipedia API有python包，可以轻松直接地访问文章，而无需在站点上过度加载或额外工作。我会使用wikipediaapi，而Wikipedia模块似乎没有维护。不过，两者都会以类似的方式完成这项工作。那么，页面的第一段是文章的内容？我怀疑。不，那只是第一段。你可以用idx==来决定你想看哪一段。我知道，但除了它可能会改变之外，从文档中提取或多或少的随机元素并不是最好的选择。@shaedrich我想再解释一遍。重点是反复阅读各个章节。例如，在类似Alexa edition的对话框中：询问什么是马。在这种情况下，您会收到大量文本。在对话中，你现在可以说：“阅读更多”，或者跳过这一章。我希望你现在理解这种迭代的可能性是多么有用。谢谢。对不起，不是真的。

from urllib.request import urlopen
from urllib.parse import urlencode
from json import loads


def getJSON(page):
    params = urlencode({
        'format': 'json',
        'action': 'parse',
        'prop': 'text',
        'redirects' : 'true',
        'page': page})
    API = "https://en.wikipedia.org/w/api.php"
    response = urlopen(API + "?" + params)
    return response.read().decode('utf-8')


def getRawPage(page):
    parsed = loads(getJSON(page))
    try:
        title = parsed['parse']['title']
        content = parsed['parse']['text']['*']
        return title, content
    except KeyError:
        # The page doesn't exist
        return None, None

title, content = getRawPage("Mathematics")

from from bs4 import BeautifulSoup
import requests

res = requests.get("https://de.wikipedia.org/wiki/Pferde")
soup = BeautifulSoup(res.text, 'html.parser')
for idx, item in enumerate(soup.find_all("p")):
    if idx == 1:
        break
print(item.text)