Python 利用报纸从HTML中提取图像

Python 利用报纸从HTML中提取图像,python,extract,python-newspaper,newspaper3k,Python,Extract,Python Newspaper,Newspaper3k,我无法像通常那样下载文章来实例化文章对象,如下所示: from newspaper import Article url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/' article = Article(url) article.download() article.top_image 但是,我可以从请求中获取HTML。我可以使用这个原始HTML并以某种方式将其传递给报纸以

我无法像通常那样下载文章来实例化文章对象,如下所示:

from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.top_image
但是,我可以从请求中获取HTML。我可以使用这个原始HTML并以某种方式将其传递给报纸以从中提取图像吗?(下面是一个尝试,但不起作用)。谢谢


首先,确保您使用的是
python3
,之前您已经运行了
pip3安装新闻纸3k

然后,如果第一个版本出现SSL错误(如下所示)

/usr/local/lib/python3.8/site packages/urllib3/connectionpool.py:981:unsecureRequestWarning:正在向主机“fox13now.com”发出未经验证的HTTPS请求。强烈建议添加证书验证。见: 警告,警告(

您可以通过添加

import urllib3
urllib3.disable_warnings()
这应该起作用:

from newspaper import Article
import urllib3
urllib3.disable_warnings()


url = "https://www.fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/"
article = Article(url)
article.download()
print(article.html)
使用python3.py运行


在文章中自己设置html对您没有多大好处,因为这样在其他字段中不会得到任何结果。请告诉我这是否解决了问题,或者是否出现了任何其他错误!

Python模块报纸允许使用代理,但此功能未在模块文档中列出


报纸代理
与代理和报纸的请求
我无法使用报纸下载的原因是因为我支持公司代理。我尝试了多种方法注入SSL证书。目前我唯一可以通过的方法是在请求中使用
verify=False
,这显然需要更改行。我可以在原始HTML上运行报纸的
摘要
,因此我的直觉告诉我,我也应该能够使用原始HTML获取图像。啊,这让事情变得复杂。你能使用全文吗?
从报纸导入全文;HTML=requests.get(…).text;text=fulltext(HTML)
是的,我可以这样做。如果您运行第二段代码,您应该能够测试文章中的哪些函数在原始HTML中工作,哪些不工作。其他选项可能是添加文章的自定义版本(请参阅上的最后一个代码块。为什么不起作用?您遇到了哪一个错误?我无法将我公司的内部SLL证书密钥注入我的请求。正在研究该问题。唯一的解决方法是手动发出请求并传递
verify=False
,这将为我提供原始HTML。)
from newspaper import Article
import urllib3
urllib3.disable_warnings()


url = "https://www.fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/"
article = Article(url)
article.download()
print(article.html)
from newspaper import Article
from newspaper.configuration import Configuration

# add your corporate proxy information and test the connection
PROXIES = {
           'http': "http://ip_address:port_number",
           'https': "https://ip_address:port_number"
          }

config = Configuration()
config.proxies = PROXIES

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
articles = Article(url, config=config)
articles.download()
articles.parse()
print(articles.top_image)
https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg
import requests
from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html = requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.download(raw_html.content)
article.parse()
print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg