Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/311.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从文件完成URL列表循环时未提供架构_Python_Beautifulsoup_Python Requests - Fatal编程技术网

Python 从文件完成URL列表循环时未提供架构

Python 从文件完成URL列表循环时未提供架构,python,beautifulsoup,python-requests,Python,Beautifulsoup,Python Requests,我正在与BeautifulSoup一起进行一个web抓取项目,在其中的一个步骤中,我需要从另一个链接列表中编译一个链接列表,该列表已保存到文件中。循环似乎运行正常,直到到达文件的最后一行,此时它将抛出一个错误requests.exceptions.MissingSchema:无效URL“h”:未提供架构。也许你的意思是http://h?。下面是完整的代码和回溯 这是否与python正在以列表的形式读取我的.txt文件中的每一行有关?我还试着只使用1个for循环 for link in seaso

我正在与BeautifulSoup一起进行一个web抓取项目,在其中的一个步骤中,我需要从另一个链接列表中编译一个链接列表,该列表已保存到文件中。循环似乎运行正常,直到到达文件的最后一行,此时它将抛出一个错误
requests.exceptions.MissingSchema:无效URL“h”:未提供架构。也许你的意思是http://h?
。下面是完整的代码和回溯

这是否与python正在以列表的形式读取我的.txt文件中的每一行有关?我还试着只使用1个for循环

for link in season_links:
    response_loop = requests.get(link[0]) 
但它没有解决这个错误

这是我的密码:

Contents of file:
https://rugby.statbunker.com/competitions/LastMatches?comp_id=98&limit=10&offs=UTC
https://rugby.statbunker.com/competitions/LastMatches?comp_id=99&limit=10&offs=UTC

# for reading season links from file
season_links = []
season_links_file = codecs.open('season_links_unpag_tst2.txt', 'r')
for line in season_links_file:
    stripped_line = line.strip()
    line_list = stripped_line.split()
    season_links.append(line_list)
season_links_file.close()
print('Season links file read complete' + '\n')
print(season_links)

# handling for pagination within each season
for link in season_links:
    t0 = time.time()
    for item in link: # for some reason it reads each row in my .txt as a list, so I have to loop over it again
        response_loop = requests.get(item)
        html_loop = response_loop.content
        soup_loop = BeautifulSoup(html_loop, 'html.parser')

        for p in soup_loop.find_all('p', text='›'):
            season_links.append(p.find_parent('a').get('href'))
        print('Season link: ' + item)
        response_delay = time.time() - t0
        print('Loop duration: ' + str(response_delay))
        time.sleep(4*response_delay)
        print('Sleep: ' + str(response_delay*4) + '\n')
回溯

Season link: https://rugby.statbunker.com/competitions/LastMatches?comp_id=1&limit=10&offs=UTC
Loop duration: 2.961906909942627
Sleep: 11.847627639770508

Season link: https://rugby.statbunker.com/competitions/LastMatches?comp_id=103&limit=10&offs=UTC
Loop duration: 1.6234941482543945
Sleep: 6.493976593017578

Traceback (most recent call last):
  File "/Users/claycrosby/Desktop/coding/projects/gambling/scraper/sb_compile_games.py", line 103, in <module>
    response_loop = requests.get(item)
  File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/sessions.py", line 516, in request
    prep = self.prepare_request(req)
  File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/sessions.py", line 449, in prepare_request
    p.prepare(
  File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/models.py", line 314, in prepare
    self.prepare_url(url, params)
  File "/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/models.py", line 388, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
[Finished in 23.3s with exit code 1]
季节链接:https://rugby.statbunker.com/competitions/LastMatches?comp_id=1&limit=10&offs=UTC
循环持续时间:2.961906909942627
睡眠时间:11.847627639770508
季节链接:https://rugby.statbunker.com/competitions/LastMatches?comp_id=103&limit=10&offs=UTC
循环持续时间:1.6234941482543945
睡眠时间:6.493976593017578
回溯(最近一次呼叫最后一次):
文件“/Users/claycrosby/Desktop/coding/projects/gambing/scraper/sb_compile_games.py”,第103行,在
response\u loop=requests.get(项目)
get中的文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/api.py”,第76行
返回请求('get',url,params=params,**kwargs)
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/api.py”,请求中的第61行
return session.request(method=method,url=url,**kwargs)
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/sessions.py”,请求中的第516行
准备=自我准备请求(req)
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/sessions.py”,第449行,在prepare_请求中
p、 预备(
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site-packages/requests/models.py”,第314行,准备中
self.prepare_url(url,参数)
文件“/opt/miniconda3/envs/ds383/lib/python3.8/site packages/requests/models.py”,第388行,在prepare\u url中
raise MissingSchema(错误)
requests.exceptions.MissingSchema:无效URL“h”:未提供架构。您的意思可能是http://h?
[以23.3s完成,退出代码为1]

编辑:我试着打印每个
项目,发现第三个项目名为
h
。我的文件中没有空格或
h
,但问题源于我试图从循环中附加到原始列表。我使用了不同的列表,处理过程没有错误

# for reading season links from file
season_links_unpag = []
season_links_file = codecs.open('season_links_unpag_tst2.txt', 'r')
for line in season_links_file:
    stripped_line = line.strip()
    line_list = stripped_line.split()
    season_links_unpag.append(line_list)
season_links_file.close()
print('Season links file read complete' + '\n')
print(season_links_unpag)

# handling for pagination within each season
season_links = []
for link in season_links_unpag:
    t0 = time.time()
    for item in link:
        print(item)
        response_loop = requests.get(item)
        html_loop = response_loop.content
        soup_loop = BeautifulSoup(html_loop, 'html.parser')

        for p in soup_loop.find_all('p', text='›'):
            season_links.append(p.find_parent('a').get('href'))
        print('Season link: ' + item)
        response_delay = time.time() - t0
        print('Loop duration: ' + str(response_delay))
        time.sleep(4*response_delay)
        print('Sleep: ' + str(response_delay*4) + '\n')