如何加速百万元素的Python嵌套循环_Python_Loops_Numpy_Vector_Astropy

如何加速百万元素的Python嵌套循环

python loops numpy vector

如何加速百万元素的Python嵌套循环,python,loops,numpy,vector,astropy,Python,Loops,Numpy,Vector,Astropy,我尝试将满足特定条件的两个对象（一个数据集包含约50万个元素，另一个包含约200万个元素）配对，然后将两个对象的信息保存到一个文件中。配对计算中不涉及许多变量，但它们对我的后续分析很重要，因此我需要跟踪这些变量并保存它们。如果有办法将整个分析矢量化，速度会快得多。下面我以随机数为例： import numpy as np from astropy import units as u from astropy.coordinates import SkyCoord from PyAstronomy

我尝试将满足特定条件的两个对象（一个数据集包含约50万个元素，另一个包含约200万个元素）配对，然后将两个对象的信息保存到一个文件中。配对计算中不涉及许多变量，但它们对我的后续分析很重要，因此我需要跟踪这些变量并保存它们。如果有办法将整个分析矢量化，速度会快得多。下面我以随机数为例：

import numpy as np
from astropy import units as u
from astropy.coordinates import SkyCoord
from PyAstronomy import pyasl


RA1 = np.random.uniform(0,360,500000)
DEC1 = np.random.uniform(-90,90,500000)
d = np.random.uniform(55,2000,500000)
z = np.random.uniform(0.05,0.2,500000)
e = np.random.uniform(0.05,1.0,500000)
s = np.random.uniform(0.05,5.0,500000)
RA2 = np.random.uniform(0,360,2000000)
DEC2 = np.random.uniform(-90,90,2000000)
n = np.random.randint(10,10000,2000000)
m = np.random.randint(10,10000,2000000)

f = open('results.txt','a')
for i in range(len(RA1)):
    if i % 50000 == 0:
        print i
    ra1 = RA1[i]
    dec1 = DEC1[i]
    c1 = SkyCoord(ra=ra1*u.degree, dec=dec1*u.degree)
    for j in range(len(RA2)):
        ra2 = RA2[j]
        dec2 = DEC2[j]
        c2 = SkyCoord(ra=ra2*u.degree, dec=dec2*u.degree)

        ang = c1.separation(c2)
        sep = d[i] * ang.radian
        pa = pyasl.positionAngle(ra1, dec1, ra2, dec2)

        if sep < 1.5:
            np.savetxt(f,np.c_[ra1,dec1,sep,z[i],e[i],s[i],n[j],m[j]], fmt = '%1.4f   %1.4f   %1.4f   %1.4f   %1.4f   %1.4f   %i   %i')

将numpy导入为np
从astropy导入单位为u
从astropy.coordinates导入SkyCoord
从pyasl导入pyasl
RA1=np.随机均匀（0360500000）
DEC1=np.随机均匀（-90,90500000）
d=np.随机均匀（552000000）
z=np.随机均匀（0.05,0.2500000）
e=np.随机均匀（0.05,1.0500000）
s=np.随机均匀（0.05,5.0500000）
RA2=np.随机均匀（03602000000）
DEC2=np.随机均匀（-90,902000000）
n=np.random.randint（1010002000000）
m=np.random.randint（1010002000000）
f=打开（'results.txt'，'a'）
对于范围内的i（len（RA1））：
如果i%50000==0：
打印i
ra1=ra1[i]
dec1=dec1[i]
c1=SkyCoord（ra=ra1*u.degree，dec=dec1*u.degree）
对于范围内的j（len（RA2））：
ra2=ra2[j]
dec2=dec2[j]
c2=SkyCoord（ra=ra2*u.degree，dec=dec2*u.degree）
ang=c1.分离（c2）
sep=d[i]*ang.弧度
pa=pyasl位置角（ra1、dec1、ra2、dec2）
如果sep<1.5：
np.savetxt（f，np.c_[ra1，dec1，sep，z[i]，e[i]，s[i]，n[j]，m[j]]，fmt='%1.4f%1.4f%1.4f%1.4f%1.4f%1.4f%1.4f%i%i'）

这里是一个在内存中使用缓冲区来减少I/O的实现。注意：为了与Python 3更兼容，我更喜欢使用

io

模块进行文件输入/输出。我认为这是最好的做法。它不会降低您的性能

import io

with io.open('results.txt', 'a') as f:
    buf = io.BytesIO()
    for i in xrange(len(RA1)):
        if i % 50000 == 0:
            print(i)
            f.write(buf.getvalue())
            buf.truncate(0)
        ra1 = RA1[i]
        dec1 = DEC1[i]
        c1 = SkyCoord(ra=ra1 * u.degree, dec=dec1 * u.degree)
        for j in xrange(len(RA2)):
            ra2 = RA2[j]
            dec2 = DEC2[j]
            c2 = SkyCoord(ra=ra2 * u.degree, dec=dec2 * u.degree)

            ang = c1.separation(c2)
            sep = d[i] * ang.radian
            pa = pyasl.positionAngle(ra1, dec1, ra2, dec2)

            if sep < 1.5:
                np.savetxt(buf, np.c_[ra1, dec1, sep, z[i], e[i], s[i], n[j], m[j]],
                           fmt='%1.4f   %1.4f   %1.4f   %1.4f   %1.4f   %1.4f   %i   %i')
    f.write(buf.getvalue())

导入io
将io.open（'results.txt'，'a'）作为f：
buf=io.BytesIO（）
对于x范围内的i（len（RA1））：
如果i%50000==0：
印刷品（一）
f、 写入（buf.getvalue（））
buf.截断（0）
ra1=ra1[i]
dec1=dec1[i]
c1=SkyCoord（ra=ra1*u.degree，dec=dec1*u.degree）
对于x范围内的j（len（RA2））：
ra2=ra2[j]
dec2=dec2[j]
c2=SkyCoord（ra=ra2*u.degree，dec=dec2*u.degree）
ang=c1.分离（c2）
sep=d[i]*ang.弧度
pa=pyasl位置角（ra1、dec1、ra2、dec2）
如果sep<1.5：
np.savetxt（buf，np.c_[ra1，dec1，sep，z[i]，e[i]，s[i]，n[j]，m[j]]，
fmt='%1.4f%1.4f%1.4f%1.4f%1.4f%1.4f%1.4f%i%i'）
f、 写入（buf.getvalue（））

注意：在Python 2中，我使用

xrange

而不是

range

来减少内存使用

buf.truncate（0）

可以被这样的新实例替换：

buf=io.BytesIO（）

。它可能会更有效…

savetxt

这种方式基本上是

astr = fmt % (ra1,dec1,sep,z[i],e[i],s[i],n[j],m[j])
astr += '\n'  # or include in fmt
f.write(astr)

也就是说，只需将格式化的行写入文件

加速的第一种方法：c2=SkyCoord，按ra2，dec2 len（RA1）次计算每对。您可以通过创建SkyCoord的缓冲区阵列来加速：

f = open('results.txt','a')
C1 = [SkyCoord(ra=ra1*u.degree, dec=DEC1[i]*u.degree) 
      for i, ra1 in enumerate(RA1)] )
C2 = [SkyCoord(ra=ra2*u.degree, dec=DEC2[i]*u.degree) 
      for i, ra2 in enumerate(RA2)] )  # buffer coords

for i, c1 in enumerate(C1):  # we only need enumerate() to get i
    for j, c2 in enumerate(C2):
        ang = c1.separation(c2)  # note we don't have to calculate c2
        if d[i] < 1.5 / ang.radian:
            # now we don't have to multiply every iteration. 
            # The right part is a constant

            # the next line is only executed if objects are close enough
            pa = pyasl.positionAngle(RA1[i], DEC1[i], RA2[j], DEC2[j])
            np.savetxt('...whatever')

你需要问自己的基本问题是：你能减少数据集吗

如果没有，我有一些坏消息：500000*2000000是

1e12

。这意味着你要做一万亿次手术

角度分离涉及到一些三角函数（我认为这里涉及到了

cos

，

sin

和

sqrt

），因此每次操作大约需要数百纳秒到微秒。假设每项操作需要1美元，您仍然需要12天来完成此操作。这假设您没有任何Python循环或IO开销，我认为1us对于此类操作是合理的

但肯定有办法对其进行优化：

SkyCoord

允许矢量化，但只允许1D：

# Create the SkyCoord for the longer array once
c2 = SkyCoord(ra=RA2*u.degree, dec=DEC2*u.degree)
# and calculate the seperation from each coordinate of the shorter list
for idx, (ra, dec) in enumerate(zip(RA1, DEC1)):
    c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
    # x will be the angular seperation with a length of your RA2 and DEC2 arrays
    x = c1.separation(c2)

这将产生几个数量级的加速：

# note that I made these MUCH shorter
RA1 = np.random.uniform(0,360,5)
DEC1 = np.random.uniform(-90,90,5)
RA2 = np.random.uniform(0,360,10)
DEC2 = np.random.uniform(-90,90,10)

def test(RA1, DEC1, RA2, DEC2):
    """Version with vectorized inner loop."""
    c2 = SkyCoord(ra=RA2*u.degree, dec=DEC2*u.degree)
    for idx, (ra, dec) in enumerate(zip(RA1, DEC1)):
        c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
        x = c1.separation(c2)

def test2(RA1, DEC1, RA2, DEC2):
    """Double loop."""
    for ra, dec in zip(RA1, DEC1):
        c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
        for ra, dec in zip(RA2, DEC2):
            c2 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
            x = c1.separation(c2)

%timeit test(RA1, DEC1, RA2, DEC2)  # 1 loop, best of 3: 225 ms per loop
%timeit test2(RA1, DEC1, RA2, DEC2) # 1 loop, best of 3: 2.71 s per loop

这已经是10倍的速度，而且它的扩展性更好：

RA1 = np.random.uniform(0,360,5)
DEC1 = np.random.uniform(-90,90,5)
RA2 = np.random.uniform(0,360,2000000)
DEC2 = np.random.uniform(-90,90,2000000)

%timeit test(RA1, DEC1, RA2, DEC2)  # 1 loop, best of 3: 2.8 s per loop

# test2 scales so bad I only use 50 elements here
RA2 = np.random.uniform(0,360,50)
DEC2 = np.random.uniform(-90,90,50)
%timeit test2(RA1, DEC1, RA2, DEC2)  # 1 loop, best of 3: 11.4 s per loop

注意，通过对内部循环进行矢量化，我能够在1/4的时间内计算出40000多倍的元素。因此，通过对内部循环进行矢量化，速度应该快约20万倍

在这里，我们计算了3秒钟内5次200万次分离，因此每次操作大约为300纳秒。以这种速度完成这项任务需要3天

即使你也可以把剩下的循环矢量化，我不认为这会产生很大的加速，因为在这个水平上，循环的开销比每个循环的计算时间要小得多。使用

line profiler

支持以下语句：

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           def test(RA1, DEC1, RA2, DEC2):
    12         1       216723 216723.0      2.6      c2 = SkyCoord(ra=RA2*u.degree, dec=DEC2*u.degree)
    13         6          222     37.0      0.0      for idx, (ra, dec) in enumerate(zip(RA1, DEC1)):
    14         5       206796  41359.2      2.5          c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
    15         5      7847321 1569464.2     94.9          x = c1.separation(c2)

如果从

点击中看不明显，那么这是从5 x 2000000运行中得到的，为了进行比较，这里是在测试2中运行5x20时得到的：
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    17                                           def test2(RA1, DEC1, RA2, DEC2):
    18         6           80     13.3      0.0      for ra, dec in zip(RA1, DEC1):
    19         5       195030  39006.0      0.6          c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
    20       105         1737     16.5      0.0          for ra, dec in zip(RA2, DEC2):
    21       100      3871427  38714.3     11.8              c2 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
    22       100     28870724 288707.2     87.6              x = c1.separation(c2)

test2
的伸缩性更差的原因是c2=SkyCoord
部分占用了总时间的12%，而不是2.5%，而且每次调用separation
都有一些显著的开销。因此，真正让它变慢的不是Python循环开销，而是SkyCoord
构造函数和分离的静态部分
显然，您需要对pa
计算和保存到文件进行矢量化（我还没有使用PyAstronomy
和numpy.savetext
，因此我不能在那里提出建议）
但是仍然存在一个问题，那就是在普通计算机上进行一万亿次三角运算是不可行的
关于如何缩短时间的其他一些想法：

使用多处理器使计算机的每个核心并行工作，从理论上讲，这可以通过增加核心数量来加快速度。实际上，这是无法实现的，我建议
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    17                                           def test2(RA1, DEC1, RA2, DEC2):
    18         6           80     13.3      0.0      for ra, dec in zip(RA1, DEC1):
    19         5       195030  39006.0      0.6          c1 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
    20       105         1737     16.5      0.0          for ra, dec in zip(RA2, DEC2):
    21       100      3871427  38714.3     11.8              c2 = SkyCoord(ra=ra*u.degree, dec=dec*u.degree)
    22       100     28870724 288707.2     87.6              x = c1.separation(c2)

aMask=(abs(RA1[:,None]-RA2[None,:])<2)&(abs(DEC1[:,None]-DEC2[None,:])<2)

locs=np.where(aMask)

(array([   0,    2,    4, ..., 4998, 4999, 4999], dtype=int32),
 array([3575, 1523, 1698, ..., 4869, 1801, 2792], dtype=int32))