Python 从文件完成URL列表循环时未提供架构
我正在与BeautifulSoup一起进行一个web抓取项目,在其中的一个步骤中,我需要从另一个链接列表中编译一个链接列表,该列表已保存到文件中。循环似乎运行正常,直到到达文件的最后一行,此时它将抛出一个错误Python 从文件完成URL列表循环时未提供架构,python,beautifulsoup,python-requests,Python,Beautifulsoup,Python Requests,我正在与BeautifulSoup一起进行一个web抓取项目,在其中的一个步骤中,我需要从另一个链接列表中编译一个链接列表,该列表已保存到文件中。循环似乎运行正常,直到到达文件的最后一行,此时它将抛出一个错误requests.exceptions.MissingSchema:无效URL“h”:未提供架构。也许你的意思是http://h?。下面是完整的代码和回溯 这是否与python正在以列表的形式读取我的.txt文件中的每一行有关?我还试着只使用1个for循环 for link in seaso
requests.exceptions.MissingSchema:无效URL“h”:未提供架构。也许你的意思是http://h?
。下面是完整的代码和回溯
这是否与python正在以列表的形式读取我的.txt文件中的每一行有关?我还试着只使用1个for循环
for link in season_links:
response_loop = requests.get(link[0])
但它没有解决这个错误
这是我的密码:
Contents of file:
https://rugby.statbunker.com/competitions/LastMatches?comp_id=98&limit=10&offs=UTC
https://rugby.statbunker.com/competitions/LastMatches?comp_id=99&limit=10&offs=UTC
# for reading season links from file
season_links = []
season_links_file = codecs.open('season_links_unpag_tst2.txt', 'r')
for line in season_links_file:
stripped_line = line.strip()
line_list = stripped_line.split()
season_links.append(line_list)
season_links_file.close()
print('Season links file read complete' + '\n')
print(season_links)
# handling for pagination within each season
for link in season_links:
t0 = time.time()
for item in link: # for some reason it reads each row in my .txt as a list, so I have to loop over it again
response_loop = requests.get(item)
html_loop = response_loop.content
soup_loop = BeautifulSoup(html_loop, 'html.parser')
for p in soup_loop.find_all('p', text='›'):
season_links.append(p.find_parent('a').get('href'))
print('Season link: ' + item)
response_delay = time.time() - t0
print('Loop duration: ' + str(response_delay))
time.sleep(4*response_delay)
print('Sleep: ' + str(response_delay*4) + '\n')
回溯
Season link: https://rugby.statbunker.com/competitions/LastMatches?comp_id=1&limit=10&offs=UTC
Loop duration: 2.961906909942627
Sleep: 11.847627639770508
Season link: https://rugby.statbunker.com/competitions/LastMatches?comp_id=103&limit=10&offs=UTC
Loop duration: 1.6234941482543945
Sleep: 6.493976593017578
Traceback (most recent call last):
File "/Users/claycrosby/Desktop/coding/projects/gambling/scraper/sb_compile_games.py", line 103, in <module>
response_loop = requests.get(item)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/sessions.py", line 516, in request
prep = self.prepare_request(req)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/sessions.py", line 449, in prepare_request
p.prepare(
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/models.py", line 314, in prepare
self.prepare_url(url, params)
File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/models.py", line 388, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
[Finished in 23.3s with exit code 1]
季节链接:https://rugby.statbunker.com/competitions/LastMatches?comp_id=1&limit=10&offs=UTC
循环持续时间:2.961906909942627
睡眠时间:11.847627639770508
季节链接:https://rugby.statbunker.com/competitions/LastMatches?comp_id=103&limit=10&offs=UTC
循环持续时间:1.6234941482543945
睡眠时间:6.493976593017578
回溯(最近一次呼叫最后一次):
文件“/Users/claycrosby/Desktop/coding/projects/gambing/scraper/sb_compile_games.py”,第103行,在
response\u loop=requests.get(项目)
get中的文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/api.py”,第76行
返回请求('get',url,params=params,**kwargs)
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/api.py”,请求中的第61行
return session.request(method=method,url=url,**kwargs)
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/sessions.py”,请求中的第516行
准备=自我准备请求(req)
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/sessions.py”,第449行,在prepare_请求中
p、 预备(
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/models.py”,第314行,准备中
self.prepare_url(url,参数)
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/models.py”,第388行,在prepare\u url中
raise MissingSchema(错误)
requests.exceptions.MissingSchema:无效URL“h”:未提供架构。您的意思可能是http://h?
[以23.3s完成,退出代码为1]
编辑:我试着打印每个
项目,发现第三个项目名为h
。我的文件中没有空格或h
,但问题源于我试图从循环中附加到原始列表。我使用了不同的列表,处理过程没有错误
# for reading season links from file
season_links_unpag = []
season_links_file = codecs.open('season_links_unpag_tst2.txt', 'r')
for line in season_links_file:
stripped_line = line.strip()
line_list = stripped_line.split()
season_links_unpag.append(line_list)
season_links_file.close()
print('Season links file read complete' + '\n')
print(season_links_unpag)
# handling for pagination within each season
season_links = []
for link in season_links_unpag:
t0 = time.time()
for item in link:
print(item)
response_loop = requests.get(item)
html_loop = response_loop.content
soup_loop = BeautifulSoup(html_loop, 'html.parser')
for p in soup_loop.find_all('p', text='›'):
season_links.append(p.find_parent('a').get('href'))
print('Season link: ' + item)
response_delay = time.time() - t0
print('Loop duration: ' + str(response_delay))
time.sleep(4*response_delay)
print('Sleep: ' + str(response_delay*4) + '\n')