Python 如何正确地进行多重处理?无效的URL';h';:没有提供架构

Python 如何正确地进行多重处理?无效的URL';h';:没有提供架构,python,multiprocessing,Python,Multiprocessing,我试图从大量的链接中获取信息,首先是TeamLink(20),然后是PlayerLink(550)。我正试图通过使用多重处理来加速这个过程。但我没有使用它的经验,在尝试运行代码时出现以下错误: Traceback (most recent call last): File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)

我试图从大量的链接中获取信息,首先是TeamLink(20),然后是PlayerLink(550)。我正试图通过使用多重处理来加速这个过程。但我没有使用它的经验,在尝试运行代码时出现以下错误:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "scrape.py", line 50, in playerlinks
    squadPage = requests.get(teamLinks[i])
  File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 519, in request
    prep = self.prepare_request(req)
  File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 462, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/anaconda3/lib/python3.6/site-packages/requests/models.py", line 313, in prepare
    self.prepare_url(url, params)
  File "/anaconda3/lib/python3.6/site-packages/requests/models.py", line 387, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
"""

The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "scrape.py", line 94, in <module>
        records = p.map(playerlinks, team)
      File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 266, in map
        return self._map_async(func, iterable, mapstar, chunksize).get()
      File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get
        raise self._value
    requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
通过使用
p.map(playerlinks,team)
python试图做的是在
team
的每个元素上应用函数
playerlinks

但是,根据您的函数定义,函数
playerlinks
设计为一次操作整个列表。你看到问题了吗

这就是您的
团队
变量所包含的内容-

['http://www.premierleague.com//clubs/1/Arsenal/squad',
 'http://www.premierleague.com//clubs/2/Aston-Villa/squad',
 'http://www.premierleague.com//clubs/127/Bournemouth/squad',
 'http://www.premierleague.com//clubs/131/Brighton-and-Hove-Albion/squad',
 'http://www.premierleague.com//clubs/43/Burnley/squad',
 'http://www.premierleague.com//clubs/4/Chelsea/squad',
 'http://www.premierleague.com//clubs/6/Crystal-Palace/squad',
 'http://www.premierleague.com//clubs/7/Everton/squad',
 'http://www.premierleague.com//clubs/26/Leicester-City/squad',
 'http://www.premierleague.com//clubs/10/Liverpool/squad',
 'http://www.premierleague.com//clubs/11/Manchester-City/squad',
 'http://www.premierleague.com//clubs/12/Manchester-United/squad',
 'http://www.premierleague.com//clubs/23/Newcastle-United/squad',
 'http://www.premierleague.com//clubs/14/Norwich-City/squad',
 'http://www.premierleague.com//clubs/18/Sheffield-United/squad',
 'http://www.premierleague.com//clubs/20/Southampton/squad',
 'http://www.premierleague.com//clubs/21/Tottenham-Hotspur/squad',
 'http://www.premierleague.com//clubs/33/Watford/squad',
 'http://www.premierleague.com//clubs/25/West-Ham-United/squad',
 'http://www.premierleague.com//clubs/38/Wolverhampton-Wanderers/squad'] 
多处理库将尝试调度

playerlinks(['http://www.premierleague.com//clubs/1/Arsenal/squad'])
playerlinks(['http://www.premierleague.com//clubs/2/Aston-Villa/squad']).... 
n上
磁芯数

playerlinks(['http://www.premierleague.com//clubs/1/Arsenal/squad“])
是引发错误的原因

修改
playerlinks
函数以对
team
变量中的单个元素进行操作,然后您将看到此问题消失

试试这样的东西-

def playerlinks_atomic(teamLinks):
    squadPage = requests.get(teamLinks)
    squadTree = html.fromstring(squadPage.content)

    #...Extract the player links...
    playerLocation = squadTree.cssselect('.playerOverviewCard')

    #...For each player link within the team page...
    for i in range(len(playerLocation)):

        #...Save the link, complete with domain...
        playerLink1.append("http://www.premierleague.com/" + playerLocation[i].attrib['href'])

        #...For the second link, change the page from player overview to stats
        playerLink2.append(playerLink1[i].replace("overview", "stats"))
    return playerLink1, playerLink2

href
属性的示例是什么样的?我有点理解你的意思,这是否意味着我必须重写函数,使其只处理一个url才能多处理该函数?对于循环意味着否。您可能还需要返回teamlinks参数,因为在返回数据结构中,您无法知道哪个输入参数给出了哪个playerlinks。@MisterButter-对于网络绑定的操作(如您发布的操作),您最好使用多线程。另外,请明确设置要在
p=Pool(#cores/threads)
中使用的核心/线程数默认情况下,
池将使用与CPU具有核心相同数量的进程。这是
Pool
用于的类型的合理默认值。池中有更多进程会导致池工作人员争夺CPU资源。使用较少的进程意味着您没有使用所有可用的CPU资源。@RolandSmith-对于I/O绑定和网络绑定的任务,创建比内核多出相当多的池进程可能是有意义的,因为这些进程可能会花费大部分时间被阻塞。我的理解正确吗?
def playerlinks_atomic(teamLinks):
    squadPage = requests.get(teamLinks)
    squadTree = html.fromstring(squadPage.content)

    #...Extract the player links...
    playerLocation = squadTree.cssselect('.playerOverviewCard')

    #...For each player link within the team page...
    for i in range(len(playerLocation)):

        #...Save the link, complete with domain...
        playerLink1.append("http://www.premierleague.com/" + playerLocation[i].attrib['href'])

        #...For the second link, change the page from player overview to stats
        playerLink2.append(playerLink1[i].replace("overview", "stats"))
    return playerLink1, playerLink2