Python 从文本文件读取多个URL并处理网页_Python_Url

Python 从文本文件读取多个URL并处理网页

python url

Python 从文本文件读取多个URL并处理网页,python,url,Python,Url,脚本的输入是一个文本文件，其中包含来自网页的多个URL。脚本中的预期步骤如下所示：从文本文件中读取url 剥离url以将其用作输出文件（fname）的名称使用正则表达式“clean_me”清理url/网页的内容将内容写入文件（fname）对输入文件中的每个文件重复此操作这是输入文件urloutshort.txt的内容以下是脚本： import os import sys import requests import bs4 from bs4 import Beautiful

脚本的输入是一个文本文件，其中包含来自网页的多个URL。脚本中的预期步骤如下所示：

从文本文件中读取url
剥离url以将其用作输出文件（fname）的名称
使用正则表达式“clean_me”清理url/网页的内容
将内容写入文件（fname）
对输入文件中的每个文件重复此操作

这是输入文件

urloutshort.txt

的内容

以下是脚本：

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
    s.decompose()       
    return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        page = requests.get(url.strip())
        fname=(url.replace('http://',' '))
        fname = fname.replace ('/',' ')
        print (fname)
        cln = clean_me(page)
        with open (fname +'.txt', 'w') as outfile:              
        outfile.write(cln +"\n")

这是错误消息

python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "webpage_A.py", line 43, in <module>
    with open (fname +'.txt', 'w') as outfile:                              
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk 
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'

python:Traceback（最近一次调用last）：
第1行字符：1
+python网页_A.py
+ ~~~~~~~~~~~~~~~~~~~
+CategoryInfo:NotSpecified:（回溯（最近一次调用）：：字符串）[]，RemoteException
+FullyQualifiedErrorId:NativeCommandError
文件“webpage_A.py”，第43行，在
将open（fname+'.txt'，w'）作为输出文件：
OSError:[Errno 22]无效参数：'feedproxy.google.com~r autonews ColumnistsAndBloggers~3 6HV2TNAKqGk
无nox排放的柴油机可能\n.txt'

问题似乎与从文本文件中读取url有关，因为如果我绕过脚本读取输入文件，只对其中一个url进行硬编码，则脚本将处理网页，并将结果保存到一个txt文件中，其中包含从url提取的名称。我已经搜索了这个话题，但没有找到解决方案

在此问题上的帮助将不胜感激

问题在于以下代码：

    with open (fname +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

fname包含“\n”，它不能是要打开的有效文件名。你所需要做的就是把它改成这个

    with open (fname.rstrip() +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

包含完整的代码修复程序：

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
        s.decompose()
        return ' '.join(soup.stripped_strings)


with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        if "http" in url:
            page = requests.get(url.strip())
            fname = (url.replace('http://', ''))
            fname = fname.replace('/', ' ')
            print(fname)
            cln = clean_me(page)
            with open(fname.rstrip() + '.txt', 'w') as outfile:
                outfile.write(cln + "\n")

希望这有帮助

问题在于以下代码：

    with open (fname +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

fname包含“\n”，它不能是要打开的有效文件名。你所需要做的就是把它改成这个

    with open (fname.rstrip() +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

包含完整的代码修复程序：

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
        s.decompose()
        return ' '.join(soup.stripped_strings)


with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        if "http" in url:
            page = requests.get(url.strip())
            fname = (url.replace('http://', ''))
            fname = fname.replace('/', ' ')
            print(fname)
            cln = clean_me(page)
            with open(fname.rstrip() + '.txt', 'w') as outfile:
                outfile.write(cln + "\n")

希望这有助于

按照建议对脚本进行更改。该脚本按预期处理第一个url，但不处理urloutshort.txt中的后续url。我改变了文件中URL的顺序，但这并没有改变结果；处理第一个url，但不处理后续url。python:Traceback（最近一次调用）：在第1行char:1+python webpage.py+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+CategoryInfo:NotSpecified:（Traceback（最近一次调用）：：String）[]，RemoteException+FullyQualifiedErrorId:NativeCommanderFile“webpage.py”，第33行，在page=requests.get（url.strip（））文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\api.py”的第72行中，在get return request（'get'，url，params=params，**kwargs）文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\api.py”的第58行中，在request-return session.request（method=method，url=url，**kwargs）文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\sessions.py”的第494行，在request-prep=self.prepare\u请求（req）文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\requests\sessions\ses\，第437行，在prepare_request hooks=merge_hooks（request.hooks，self.hooks）文件“C:\Users\rschafish\AppData\Local\Programs\Python 35-32\lib\site packages\requests\models.py”中，第305行，在prepare self.prepare_url（url，params）中对脚本进行了建议的更改。该脚本按预期处理第一个url，但不处理urloutshort.txt中的后续url。我改变了文件中URL的顺序，但这并没有改变结果；处理第一个url，但不处理后续url。python:Traceback（最近一次调用）：在第1行char:1+python webpage.py+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+CategoryInfo:NotSpecified:（Traceback（最近一次调用）：：String）[]，RemoteException+FullyQualifiedErrorId:NativeCommanderFile“webpage.py”，第33行，在page=requests.get（url.strip（））文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\api.py”的第72行中，在get return request（'get'，url，params=params，**kwargs）文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\api.py”的第58行中，在request-return session.request（method=method，url=url，**kwargs）文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\sessions.py”的第494行，在request-prep=self.prepare\u请求（req）文件“C:\Users\rschafish\AppData\Local\Programs\Python35-32\lib\site packages\requests\requests\sessions\ses\，第437行，在prepare_request hooks=merge_hooks（request.hooks，self.hooks）中，文件“C:\Users\rschafish\AppData\Local\Programs\Python 35-32\lib\site packages\requests\models.py”，第305行，在prepare self.prepare_url（url，params）中