Python 如何使用BeautifulSoup从txt文件中使用多个URL_Python_Beautifulsoup

Python 如何使用BeautifulSoup从txt文件中使用多个URL

python

Python 如何使用BeautifulSoup从txt文件中使用多个URL,python,beautifulsoup,Python,Beautifulsoup,我是新手，我的代码运行成功，但在.txt文件中只有一个URL，如果添加更多URL，则会抛出错误。我尝试了在这个网站上找到的多种方法，但似乎找不到一种有效的方法。如果有人能帮助我，那就太好了我的主要目标是让它在完成后查看第一个URL，然后启动第二个URL并循环浏览它们这是我现在拥有的 import requests import lxml.html from bs4 import BeautifulSoup from fake_useragent import UserAgent from d

我是新手，我的代码运行成功，但在.txt文件中只有一个URL，如果添加更多URL，则会抛出错误。我尝试了在这个网站上找到的多种方法，但似乎找不到一种有效的方法。如果有人能帮助我，那就太好了

我的主要目标是让它在完成后查看第一个URL，然后启动第二个URL并循环浏览它们

这是我现在拥有的

import requests
import lxml.html
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed

ua = UserAgent()
header = {'user-agent':ua.random}

with open('urls.txt','r') as file:
    for url in file.readlines():
        result = requests.get(url,headers=header,timeout=3)
        src = result.content
        soup = BeautifulSoup(src, 'lxml')

你需要把它们绕过去。此代码假定文件中每行有一个url：

import requests
import lxml.html
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed

ua = UserAgent()
header = {'user-agent':ua.random}

with open('urls.txt','r') as file:
    for url in file.readlines():
        result = requests.get(url,headers=header,timeout=3)
        src = result.content
        soup = BeautifulSoup(src, 'lxml')

代码中的内容太多了。我不确定真正的问题是什么？你能获取url.txt吗？如果是这样，它包含什么

作为起点，请尝试将代码分离为方法

例如：

import requests
import lxml.html
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed

def getReadMe():
    with open('urls.txt','r') as file:
        return file.read()

def getHtml(readMe):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(readMe,headers=header,timeout=3)
    response.raise_for_status() # throw error for 4xx & 5xx
    return response.content

readMe = getReadMe()
print(readMe) #TODO: does this output text? If so what is it?
html = getHtml(readMe)
soup = BeautifulSoup(src, 'lxml')
# TODO: what is in the response html?

我刚才试过这个，我不再出错，但现在它只读取我的.txt

文件的最后一行。readlines（）

将迭代文件中的所有行。打印URL而不是请求它们，然后查看您得到了什么。可能是你的URL有问题。txt显示了我放在.txt文件中的两个URL，但只对第二个URL运行抓取。。。奇怪，不，不是。您可能正在将第一个请求的响应存储在一个变量中，然后用第二个请求的响应覆盖该变量。你所看到的就是第二个反应，正如我所说的。退出for循环，因此变量

soup

中包含的唯一数据来自循环的最后一次迭代。在声明

soup

变量的行下方缩进所有内容，使其位于for循环内。