Python 如何正确地进行多重处理？无效的URL'；h'；：没有提供架构_Python_Multiprocessing

Python 如何正确地进行多重处理？无效的URL'；h'；：没有提供架构

python

Python 如何正确地进行多重处理？无效的URL'；h'；：没有提供架构,python,multiprocessing,Python,Multiprocessing,我试图从大量的链接中获取信息，首先是TeamLink（20），然后是PlayerLink（550）。我正试图通过使用多重处理来加速这个过程。但我没有使用它的经验，在尝试运行代码时出现以下错误： Traceback (most recent call last): File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)

我试图从大量的链接中获取信息，首先是TeamLink（20），然后是PlayerLink（550）。我正试图通过使用多重处理来加速这个过程。但我没有使用它的经验，在尝试运行代码时出现以下错误：

Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "scrape.py", line 50, in playerlinks
    squadPage = requests.get(teamLinks[i])
  File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 519, in request
    prep = self.prepare_request(req)
  File "/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 462, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/anaconda3/lib/python3.6/site-packages/requests/models.py", line 313, in prepare
    self.prepare_url(url, params)
  File "/anaconda3/lib/python3.6/site-packages/requests/models.py", line 387, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
"""

The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "scrape.py", line 94, in <module>
        records = p.map(playerlinks, team)
      File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 266, in map
        return self._map_async(func, iterable, mapstar, chunksize).get()
      File "/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get
        raise self._value
    requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

通过使用

p.map（playerlinks，team）

python试图做的是在

team

的每个元素上应用函数

playerlinks

但是，根据您的函数定义，函数

playerlinks

设计为一次操作整个列表。你看到问题了吗

这就是您的

团队

变量所包含的内容-

['http://www.premierleague.com//clubs/1/Arsenal/squad',
 'http://www.premierleague.com//clubs/2/Aston-Villa/squad',
 'http://www.premierleague.com//clubs/127/Bournemouth/squad',
 'http://www.premierleague.com//clubs/131/Brighton-and-Hove-Albion/squad',
 'http://www.premierleague.com//clubs/43/Burnley/squad',
 'http://www.premierleague.com//clubs/4/Chelsea/squad',
 'http://www.premierleague.com//clubs/6/Crystal-Palace/squad',
 'http://www.premierleague.com//clubs/7/Everton/squad',
 'http://www.premierleague.com//clubs/26/Leicester-City/squad',
 'http://www.premierleague.com//clubs/10/Liverpool/squad',
 'http://www.premierleague.com//clubs/11/Manchester-City/squad',
 'http://www.premierleague.com//clubs/12/Manchester-United/squad',
 'http://www.premierleague.com//clubs/23/Newcastle-United/squad',
 'http://www.premierleague.com//clubs/14/Norwich-City/squad',
 'http://www.premierleague.com//clubs/18/Sheffield-United/squad',
 'http://www.premierleague.com//clubs/20/Southampton/squad',
 'http://www.premierleague.com//clubs/21/Tottenham-Hotspur/squad',
 'http://www.premierleague.com//clubs/33/Watford/squad',
 'http://www.premierleague.com//clubs/25/West-Ham-United/squad',
 'http://www.premierleague.com//clubs/38/Wolverhampton-Wanderers/squad']

多处理库将尝试调度

playerlinks(['http://www.premierleague.com//clubs/1/Arsenal/squad'])
playerlinks(['http://www.premierleague.com//clubs/2/Aston-Villa/squad'])....

在

n上

磁芯数

playerlinks（['http://www.premierleague.com//clubs/1/Arsenal/squad“]）

是引发错误的原因

修改

playerlinks

函数以对

team

变量中的单个元素进行操作，然后您将看到此问题消失

试试这样的东西-

def playerlinks_atomic(teamLinks):
    squadPage = requests.get(teamLinks)
    squadTree = html.fromstring(squadPage.content)

    #...Extract the player links...
    playerLocation = squadTree.cssselect('.playerOverviewCard')

    #...For each player link within the team page...
    for i in range(len(playerLocation)):

        #...Save the link, complete with domain...
        playerLink1.append("http://www.premierleague.com/" + playerLocation[i].attrib['href'])

        #...For the second link, change the page from player overview to stats
        playerLink2.append(playerLink1[i].replace("overview", "stats"))
    return playerLink1, playerLink2

href

属性的示例是什么样的？我有点理解你的意思，这是否意味着我必须重写函数，使其只处理一个url才能多处理该函数？对于循环意味着否。您可能还需要返回teamlinks参数，因为在返回数据结构中，您无法知道哪个输入参数给出了哪个playerlinks。@MisterButter-对于网络绑定的操作（如您发布的操作），您最好使用多线程。另外，请明确设置要在

p=Pool（#cores/threads）

中使用的核心/线程数默认情况下，

池将使用与CPU具有核心相同数量的进程。这是Pool用于的类型的合理默认值。池中有更多进程会导致池工作人员争夺CPU资源。使用较少的进程意味着您没有使用所有可用的CPU资源。@RolandSmith-对于I/O绑定和网络绑定的任务，创建比内核多出相当多的池进程可能是有意义的，因为这些进程可能会花费大部分时间被阻塞。我的理解正确吗？
def playerlinks_atomic(teamLinks):
    squadPage = requests.get(teamLinks)
    squadTree = html.fromstring(squadPage.content)

    #...Extract the player links...
    playerLocation = squadTree.cssselect('.playerOverviewCard')

    #...For each player link within the team page...
    for i in range(len(playerLocation)):

        #...Save the link, complete with domain...
        playerLink1.append("http://www.premierleague.com/" + playerLocation[i].attrib['href'])

        #...For the second link, change the page from player overview to stats
        playerLink2.append(playerLink1[i].replace("overview", "stats"))
    return playerLink1, playerLink2