Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/279.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从文件中打开链接时,Beauty soup无法从页面中提取HTML_Python_Html_Web Scraping_Beautifulsoup_Web Crawler - Fatal编程技术网

Python 从文件中打开链接时,Beauty soup无法从页面中提取HTML

Python 从文件中打开链接时,Beauty soup无法从页面中提取HTML,python,html,web-scraping,beautifulsoup,web-crawler,Python,Html,Web Scraping,Beautifulsoup,Web Crawler,我在一个名为article_links.txt的文件中有一些web链接,我想逐个打开,提取它们的文本,然后打印出来。我的代码是: import requests from inscriptis import get_text from bs4 import BeautifulSoup links = open(r'C:\Users\h473\Documents\Crawling\article_links.txt', "r") for a in links: print(a)

我在一个名为
article_links.txt
的文件中有一些web链接,我想逐个打开,提取它们的文本,然后打印出来。我的代码是:

import requests
from inscriptis import get_text
from bs4 import BeautifulSoup

links = open(r'C:\Users\h473\Documents\Crawling\article_links.txt', "r")

for a in links:
    print(a)
    page = requests.get(a)
    soup = BeautifulSoup(page.text, 'lxml')
    html = soup.find(class_='article-wrap')
    if html==None:
        html = soup.find(class_='mag-article-wrap')

    text = get_text(html.text)

    print(text)
但是我得到一个错误,说,
-->text=get_text(html.text)

AttributeError:“非类型”对象没有属性“文本”

因此,当我打印出
soup
变量以查看ts内容是什么时。以下是我对每个链接的发现:

http://www3.asiainsurancereview.com//Mock-News-Article/id/42945/Type/eDaily/New-Zealand-Govt-starts-public-consultation-phase-of-review-of-insurance-law

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>Bad Request</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/></head>
<body><h2>Bad Request - Invalid URL</h2>
<hr/><p>HTTP Error 400. The request URL is invalid.</p>
</body></html>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>Bad Request</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/></head>
<body><h2>Bad Request - Invalid URL</h2>
<hr/><p>HTTP Error 400. The request URL is invalid.</p>
</body></html>
而且它工作得很好!因此,我尝试以列表/数组形式提供链接,并尝试从每个链接中提取文本:

import requests
from inscriptis import get_text
from bs4 import BeautifulSoup

links = ['http://www3.asiainsurancereview.com//Mock-News-Article/id/42945/Type/eDaily/New-Zealand-Govt-starts-public-consultation-phase-of-review-of-insurance-law',
'http://www3.asiainsurancereview.com//Mock-News-Article/id/42946/Type/eDaily/India-M-A-deals-brewing-in-insurance-sector',
'http://www3.asiainsurancereview.com//Mock-News-Article/id/42947/Type/eDaily/China-Online-insurance-premiums-soar-31-in-1Q2018',
'http://www3.asiainsurancereview.com//Mock-News-Article/id/42948/Type/eDaily/South-Korea-Courts-increasingly-see-65-as-retirement-age',
'http://www3.asiainsurancereview.com//Magazine/ReadMagazineArticle/aid/40847/Creating-a-growth-environment-for-health-insurance-in-Asia']

#open(r'C:\Users\h473\Documents\Crawling\article_links.txt', "r")

for a in links:
    print(a)
    page = requests.get(a)
    soup = BeautifulSoup(page.text, 'lxml')
    html = soup.find(class_='article-wrap')
    if html==None:
        html = soup.find(class_='mag-article-wrap')

    text = get_text(html.text)

    print(text)

这也非常有效!那么,从文本文件中提取链接时出现了什么问题?如何修复它?

我不知道你的文件里有什么。但在我看来,您的文件中可能有一个新的空行导致
NoneType
object

问题在于您的URL无效,因为它们都以一个新行结尾。你可以看到类似这样的事情:

>>> page = requests.get('http://www3.asiainsurancereview.com//Mock-News-Article/id/42945/Type/eDaily/New-Zealand-Govt-starts-public-consultation-phase-of-review-of-insurance-law\n')
>>> page
<Response [400]>
>>> page.text
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<HTML><HEAD><TITLE>Bad Request</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Bad Request - Invalid URL</h2>
<hr><p>HTTP Error 400. The request URL is invalid.</p>
</BODY></HTML>
尝试:


这是如何解决问题的?你甚至没有正确格式化代码。啊!谢谢你的解释。
>>> page = requests.get('http://www3.asiainsurancereview.com//Mock-News-Article/id/42945/Type/eDaily/New-Zealand-Govt-starts-public-consultation-phase-of-review-of-insurance-law\n')
>>> page
<Response [400]>
>>> page.text
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<HTML><HEAD><TITLE>Bad Request</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Bad Request - Invalid URL</h2>
<hr><p>HTTP Error 400. The request URL is invalid.</p>
</BODY></HTML>
for a in links:
    a = a.rstrip()
    # rest of your code
with f open("sample.txt"):
    for line in f:
        print(line)