Python 美化组和大型html_Python_Html_Beautifulsoup_Large Files_Scrape

Python 美化组和大型html

python html

Python 美化组和大型html,python,html,beautifulsoup,large-files,scrape,Python,Html,Beautifulsoup,Large Files,Scrape,我试着去刮一些大型维基百科页面，比如不幸的是，BeautifulSoup无法处理如此大的内容，它会截断页面。我在中使用BeautifulSoup找到了解决此问题的方法，因为我认为它比lxml更容易您只需安装： pip install html5lib 并将其作为参数添加到BeautifulSoup： soup = BeautifulSoup(htmlContent, 'html5lib') 但是，如果您愿意，也可以按如下方式使用lxml： import lxml.html doc =

我试着去刮一些大型维基百科页面，比如

不幸的是，

BeautifulSoup

无法处理如此大的内容，它会截断页面。

我在中使用BeautifulSoup找到了解决此问题的方法，因为我认为它比lxml更容易

您只需安装：

pip install html5lib

并将其作为参数添加到BeautifulSoup：

soup = BeautifulSoup(htmlContent, 'html5lib')

但是，如果您愿意，也可以按如下方式使用lxml：

import lxml.html

doc = lxml.html.parse('https://en.wikipedia.org/wiki/Talk:Game_theory')

我建议您获取html内容，然后将其传递给BS：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://en.wikipedia.org/wiki/Talk:Game_theory')
if r.ok:
  soup = BeautifulSoup(r.content)
  # get the div with links at the bottom of the page
  links_div = soup.find('div', id='catlinks')
  for a in links_div.find_all('a'):
    print a.text
else:
  print r.status_code

我做了完全相同的事情，但是没有'html5lib'参数，内容中就没有id='catlinks'！无论如何，在我的浏览器中有（…），看看firefox/chrome中的源代码，在页面底部寻找一个元素，并尝试使用BS查找/提取它。我做了完全相同的事情。你认为它可能与Python版本（2.7）、操作系统（Win8）有关吗。。。？我认为这与浏览器无关，因为我检查了BeautifulSoup.prettify生成的实际html代码，但它不完整。我可以确认我上面编写的代码在windows 7中不起作用，但在Linux中使用python 2.7时起作用。奇怪的