用Python抓网，靓汤，硒不起作用_Python_Selenium_Web Scraping_Beautifulsoup

用Python抓网，靓汤，硒不起作用

python selenium web-scraping

用Python抓网，靓汤，硒不起作用,python,selenium,web-scraping,beautifulsoup,Python,Selenium,Web Scraping,Beautifulsoup,我正在做一个Python练习，它要求我通过网页抓取和打印到控制台，从Google新闻网站上获取头条新闻。当我在做这件事的时候，我只是用漂亮的汤库来检索新闻。这是我的密码： import bs4 from bs4 import BeautifulSoup import urllib.request news_url = "https://news.google.com/news/rss"; URLObject = urllib.request.urlopen(news_url); xml_pa

我正在做一个Python练习，它要求我通过网页抓取和打印到控制台，从Google新闻网站上获取头条新闻。当我在做这件事的时候，我只是用漂亮的汤库来检索新闻。这是我的密码：

import bs4
from bs4 import BeautifulSoup
import urllib.request

news_url = "https://news.google.com/news/rss";
URLObject = urllib.request.urlopen(news_url);
xml_page = URLObject.read();
URLObject.close();

soup_page = BeautifulSoup(xml_page,"html.parser");
news_list = soup_page.findAll("item");

for news in news_list:
  print(news.title.text);
  print(news.link.text);
  print(news.pubDate.text);
  print("-"*60);

但它不打印“链接”和“发布日期”，不断给我带来错误。经过一些研究，我在这里看到了一些关于堆栈溢出的答案，他们说，由于网站使用Javascript，除了漂亮的汤之外，还应该使用Selenium包。尽管不了解Selenium的实际工作原理，但我还是按照以下方式更新了代码：

from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request

driver = webdriver.Chrome("C:/Users/mauricio/Downloads/chromedriver");
driver.maximize_window();
driver.get("https://news.google.com/news/rss");
content = driver.page_source.encode("utf-8").strip();
soup = BeautifulSoup(content, "html.parser");
news_list = soup.findAll("item");

print(news_list);

for news in news_list:
  print(news.title.text);
  print(news.link.text);
  print(news.pubDate.text);
  print("-"*60);

但是，当我运行它时，会打开一个空白的浏览器页面，并将其打印到控制台：

 raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
  (Driver info: chromedriver=2.38.551601 (edb21f07fc70e9027c746edd3201443e011a61ed),platform=Windows NT 6.3.9600 x86_64)

我刚刚试过，下面的代码对我有效。

items=

行很糟糕，请提前道歉。但现在它起作用了

编辑刚刚更新了代码段，您可以使用

ElementTree.iter（'tag'）

在所有节点上使用该

标记进行迭代：
import urllib.request
import xml.etree.ElementTree

news_url = "https://news.google.com/news/rss"
with urllib.request.urlopen(news_url) as page:
    xml_page = page.read()

# Parse XML page
e = xml.etree.ElementTree.fromstring(xml_page)

# Get the item list
for it in e.iter('item'):
    print(it.find('title').text)
    print(it.find('link').text)
    print(it.find('pubDate').text, '\n')


EDIT2：讨论用于刮取的库的个人偏好

就个人而言，对于交互式/动态页面，我必须在其中做一些事情（单击此处，填写表单，获得结果，…）：我使用selenium
，通常我不需要使用bs4
，因为您可以直接使用selenium来查找和解析您正在查找的web的特定节点
我将bs4
与requests
（而不是urlib.request
）结合使用，以便在我不希望安装整个webdriver的项目中解析更多静态网页
使用urllib.request
没有什么错，但是requests
（请参见此处）是最好的python包之一（在我看来），是如何创建简单但功能强大的API的一个很好的例子。
只需与一起使用即可
我相信您拥有的链接（使用/rss）是指向XML文件的，因此其中没有使用javascript。那么，如何使“news.link.text”和“news.pubDate.text”都出现在我的输出中？当我用漂亮的汤打印它们时，“news.title.text”正常打印，链接打印新行，而pub date是一个例外，因为它不返回任何类型，而我在其中使用了“.text”。这对我现在很有效。我以前没有听说过“xml.etree.ElementTree”。这是一种更可靠的网络报废方式，而不是单纯的靓汤或靓汤+硒？ElementTree（或python2的cElementTree）在解析XML方面通常比任何其他（python）选项都要好一点。请参阅中有关XML解析的简要比较python@MauriceFigueiredo我添加了第二个编辑，其中有一个关于我个人对刮库的偏好的小讨论。这太完美了，WillMonge。谢谢
from bs4 import BeautifulSoup
import requests

r = requests.get('https://news.google.com/news/rss')
soup = BeautifulSoup(r.text, 'xml')
news_list = soup.find_all('item')

# do whatever you need with news_list