Python 从文本文件读取多个URL并处理网页
脚本的输入是一个文本文件,其中包含来自网页的多个URL。脚本中的预期步骤如下所示:Python 从文本文件读取多个URL并处理网页,python,url,Python,Url,脚本的输入是一个文本文件,其中包含来自网页的多个URL。脚本中的预期步骤如下所示: 从文本文件中读取url 剥离url以将其用作输出文件(fname)的名称 使用正则表达式“clean_me”清理url/网页的内容 将内容写入文件(fname) 对输入文件中的每个文件重复此操作 这是输入文件urloutshort.txt的内容 以下是脚本: import os import sys import requests import bs4 from bs4 import Beautiful
- 从文本文件中读取url
- 剥离url以将其用作输出文件(fname)的名称
- 使用正则表达式“clean_me”清理url/网页的内容
- 将内容写入文件(fname)
- 对输入文件中的每个文件重复此操作
urloutshort.txt
的内容
以下是脚本:
import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re
def clean_me(htmldoc):
soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
for s in soup(['script', 'style']):
s.decompose()
return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
for url in filein:
page = requests.get(url.strip())
fname=(url.replace('http://',' '))
fname = fname.replace ('/',' ')
print (fname)
cln = clean_me(page)
with open (fname +'.txt', 'w') as outfile:
outfile.write(cln +"\n")
这是错误消息
python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
File "webpage_A.py", line 43, in <module>
with open (fname +'.txt', 'w') as outfile:
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'
python:Traceback(最近一次调用last):
第1行字符:1
+python网页_A.py
+ ~~~~~~~~~~~~~~~~~~~
+CategoryInfo:NotSpecified:(回溯(最近一次调用)::字符串)[],RemoteException
+FullyQualifiedErrorId:NativeCommandError
文件“webpage_A.py”,第43行,在
将open(fname+'.txt',w')作为输出文件:
OSError:[Errno 22]无效参数:'feedproxy.google.com~r autonews ColumnistsAndBloggers~3 6HV2TNAKqGk
无nox排放的柴油机可能\n.txt'
问题似乎与从文本文件中读取url有关,因为如果我绕过脚本读取输入文件,只对其中一个url进行硬编码,则脚本将处理网页,并将结果保存到一个txt文件中,其中包含从url提取的名称。我已经搜索了这个话题,但没有找到解决方案
在此问题上的帮助将不胜感激 问题在于以下代码:
with open (fname +'.txt', 'w') as outfile:
outfile.write(cln +"\n")
fname包含“\n”,它不能是要打开的有效文件名。你所需要做的就是把它改成这个
with open (fname.rstrip() +'.txt', 'w') as outfile:
outfile.write(cln +"\n")
包含完整的代码修复程序:
import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib
def clean_me(htmldoc):
soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
for s in soup(['script', 'style']):
s.decompose()
return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
for url in filein:
if "http" in url:
page = requests.get(url.strip())
fname = (url.replace('http://', ''))
fname = fname.replace('/', ' ')
print(fname)
cln = clean_me(page)
with open(fname.rstrip() + '.txt', 'w') as outfile:
outfile.write(cln + "\n")
希望这有帮助问题在于以下代码:
with open (fname +'.txt', 'w') as outfile:
outfile.write(cln +"\n")
fname包含“\n”,它不能是要打开的有效文件名。你所需要做的就是把它改成这个
with open (fname.rstrip() +'.txt', 'w') as outfile:
outfile.write(cln +"\n")
包含完整的代码修复程序:
import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib
def clean_me(htmldoc):
soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
for s in soup(['script', 'style']):
s.decompose()
return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
for url in filein:
if "http" in url:
page = requests.get(url.strip())
fname = (url.replace('http://', ''))
fname = fname.replace('/', ' ')
print(fname)
cln = clean_me(page)
with open(fname.rstrip() + '.txt', 'w') as outfile:
outfile.write(cln + "\n")
希望这有助于按照建议对脚本进行更改。该脚本按预期处理第一个url,但不处理urloutshort.txt中的后续url。我改变了文件中URL的顺序,但这并没有改变结果;处理第一个url,但不处理后续url。python:Traceback(最近一次调用):在第1行char:1+python webpage.py+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+CategoryInfo:NotSpecified:(Traceback(最近一次调用)::String)[],RemoteException+FullyQualifiedErrorId:NativeCommanderFile“webpage.py”,第33行,在page=requests.get(url.strip())文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\api.py”的第72行中,在get return request('get',url,params=params,**kwargs)文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\api.py”的第58行中,在request-return session.request(method=method,url=url,**kwargs)文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\sessions.py”的第494行,在request-prep=self.prepare\u请求(req)文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\requests\sessions\ses\,第437行,在prepare_request hooks=merge_hooks(request.hooks,self.hooks)文件“C:\Users\rschafish\AppData\Local\Programs\Python 35-32\lib\site packages\requests\models.py”中,第305行,在prepare self.prepare_url(url,params)中对脚本进行了建议的更改。该脚本按预期处理第一个url,但不处理urloutshort.txt中的后续url。我改变了文件中URL的顺序,但这并没有改变结果;处理第一个url,但不处理后续url。python:Traceback(最近一次调用):在第1行char:1+python webpage.py+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+CategoryInfo:NotSpecified:(Traceback(最近一次调用)::String)[],RemoteException+FullyQualifiedErrorId:NativeCommanderFile“webpage.py”,第33行,在page=requests.get(url.strip())文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\api.py”的第72行中,在get return request('get',url,params=params,**kwargs)文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\api.py”的第58行中,在request-return session.request(method=method,url=url,**kwargs)文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\sessions.py”的第494行,在request-prep=self.prepare\u请求(req)文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\requests\sessions\ses\,第437行,在prepare_request hooks=merge_hooks(request.hooks,self.hooks)中,文件“C:\Users\rschafish\AppData\Local\Programs\Python 35-32\lib\site packages\requests\models.py”,第305行,在prepare self.prepare_url(url,params)中