Python 封装矢量化函数-用于Panda数据帧_Python_Pandas_Numpy_Vector_Encapsulation

Python 封装矢量化函数-用于Panda数据帧

python pandas numpy vector

Python 封装矢量化函数-用于Panda数据帧,python,pandas,numpy,vector,encapsulation,Python,Pandas,Numpy,Vector,Encapsulation,我一直在重新分解一些代码，并使用它来探索如何在使用Pandas和Numpy时构造可维护、灵活、简洁的代码。（通常我只是简单地使用它们，我现在的角色应该是成为一名前冲刺者。）我遇到的一个例子是一个函数，它有时可以在一列值上调用，有时可以在三列值上调用。使用Numpy的矢量化代码完美地封装了它。但是使用它会变得有点笨重我应该如何“更好地”编写以下函数 def project_unit_space_to_index_space(v, vertices_per_edge): return n

我一直在重新分解一些代码，并使用它来探索如何在使用Pandas和Numpy时构造可维护、灵活、简洁的代码。（通常我只是简单地使用它们，我现在的角色应该是成为一名前冲刺者。）

我遇到的一个例子是一个函数，它有时可以在一列值上调用，有时可以在三列值上调用。使用Numpy的矢量化代码完美地封装了它。但是使用它会变得有点笨重

我应该如何“更好地”编写以下函数

def project_unit_space_to_index_space(v, vertices_per_edge):
    return np.rint((v + 1) / 2 * (vertices_per_edge - 1)).astype(int)


input = np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0)

index_space = project_unit_space_to_index_space(input, 42)

magic_space = some_other_transformation_code(index_space, foo, bar)

df['x_'], df['y_'], df['z_'] = magic_space

正如编写的那样，该函数可以接受一列数据，也可以接受多列数据，并且仍然能够正确、快速地工作

返回类型是直接传递给另一个结构类似的函数的正确形状，允许我整齐地链接函数

即使将结果分配回数据帧中的新列也不“糟糕”，尽管这有点笨拙

但是将输入打包成一个单独的

np.ndarray

确实非常笨拙

我还没有找到任何关于这个的风格指南。它们到处都是itterrows和lambda表达式等，但我没有找到封装这种逻辑的最佳实践

那么，您是如何构建上述代码的

编辑：整理输入的各种选项的计时

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].unstack().to_numpy())                      
# 1.44 ms ± 57.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].to_numpy().T)                              
# 558 µs ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(df[['x','y','z']].transpose().to_numpy())                    
# 817 µs ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit test = project_unit_sphere_to_unit_cube(np.concatenate(([df['x']], [df['y']], [df['z']]), axis=0))   
# 3.46 ms ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

您正在从数据帧的n列生成（n，m）数组：

In [103]: np.concatenate([[df[0]],[df[1]],[df[2]]],0)                           
Out[103]: 
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

更紧凑的方法是转置这些列的数组：

In [104]: df.to_numpy().T                                                       
Out[104]: 
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

数据帧有自己的转置：

In [109]: df.transpose().to_numpy()                                             
Out[109]: 
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

您的计算使用数据帧，返回具有相同形状和索引的数据帧：

In [113]: np.rint((df+1)/2 *(42-1)).astype(int)                                 
Out[113]: 
     0    1    2
0   20   41   62
1   82  102  123
2  144  164  184
3  205  226  246

一些

numpy

函数将输入转换为

numpy

数组并返回数组。其他人，通过将细节委托给

pandas

方法，可以直接在数据帧上工作，并返回一个数据帧。

我不喜欢接受我自己的答案，所以我不会更改接受的答案

@hpaulj通过让我清楚地了解其他功能和机会，帮助我进一步探索这个问题。这有助于我更清楚地定义我的竞争目标，也有助于我开始将优先权归于这些目标

代码应简洁/紧凑且易于维护，不应充满锅炉板，包括

调用函数
利用结果
函数实现本身

功能性能不应受到损害

速度慢5%，但在其他方面都更好，这是可以接受的
速度慢100%可能永远都是不可接受的

实现应该尽可能不区分数据类型

一个用于标量的函数和另一个用于向量的函数不太理想

这使我找到了我当前首选的实现/风格

def scale_unit_cube_to_unit_sphere(*values):
    """
    Scales all the inputs (on a row basis for array_line types) such that when
    treated as n-dimensional vectors, their scale is always 1.

    (Divides the vector represented by each row of inputs by that row's
     root-of-sum-of-squares, so as to normalise to a unit magnitude.)

    Examples - Scalar Inputs
    --------

    >>> scale_unit_cube_to_unit_sphere(1, 1, 1)
    [0.5773502691896258, 0.5773502691896258, 0.5773502691896258]

    Examples - Array Like Inputs
    --------

    >>> x = [ 1, 2, 3]
    >>> y = [ 1, 4, 3]
    >>> z = [ 1,-3,-1]
    >>> scale_unit_cube_to_unit_sphere(x, y, z)
    [array([0.57735027, 0.37139068, 0.6882472 ]),
     array([0.57735027, 0.74278135, 0.6882472 ]),
     array([ 0.57735027, -0.55708601, -0.22941573])]

    >>> a = np.array([x, y, z])
    >>> scale_unit_cube_to_unit_sphere(*a)
    [array([0.57735027, 0.37139068, 0.6882472 ]),
     array([0.57735027, 0.74278135, 0.6882472 ]),
     array([ 0.57735027, -0.55708601, -0.22941573])]

    scale_unit_cube_to_unit_sphere(*t)
    >>> t = (x, y, z)
    >>> scale_unit_cube_to_unit_sphere(*t)
    [array([0.57735027, 0.37139068, 0.6882472 ]),
     array([0.57735027, 0.74278135, 0.6882472 ]),
     array([ 0.57735027, -0.55708601, -0.22941573])]

    >>> df = pd.DataFrame(data={'x':x,'y':y,'z':z})
    >>> scale_unit_cube_to_unit_sphere(df['x'], df['y'], df['z'])
    [0    0.577350
     1    0.371391
     2    0.688247
     dtype: float64,
     0    0.577350
     1    0.742781
     2    0.688247
     dtype: float64,
     0    0.577350
     1   -0.557086
     2   -0.229416
     dtype: float64]

    For all array_like inputs, the results can then be utilised in similar
    ways, such as writing them to an existing DataFrame as follows:

    >>> transform = scale_unit_cube_to_unit_sphere(df['x'], df['y'], df['z'])
    >> df['i'], df['j'], df['k'] = transform

    """
    # Scale the position in space to be a unit vector, as on the surface of a sphere
    ################################################################################

    scaler = np.sqrt(sum([np.multiply(v, v) for v in values]))
    return [np.divide(v, scaler) for v in values]

根据doc字符串，这适用于标量、数组、系列等，无论是否提供一个标量、三个标量、n个标量、n个数组等

（我还没有一种整洁的方式来传递一个数据帧，而不是三个不同的数据系列，但目前优先级较低。）

它们也在“链”中工作，如下面的示例（函数的实现不相关，只是将输入链接到输出的模式）

请发表评论或评论。

如果您正在寻找替换

np。连接部分，您可以制作列列表并执行df[list\u of_cols\u to\u pass].unstack（）.to\u numpy（）
。@anky:“修复”整理多个输入的方法当然是一种选择，谢谢。我会把这个作为答案；）然而，如果你这样做，你会不会有不同的结构呢？这很好，hpaulj的替代方案比unstack好，因为它使用纯numpy（因此更快），我没有想到这一点。这取决于“一些其他的转换代码”，如果我要重新构造代码，rest看起来不错。如果你能解释一下你不喜欢输入的其他部分：）@anky我主要是出于无知的立场说的。这是我所能做的最好的事情，我可以从我能找到的功能中自己发明一个模式。我很好奇是否存在其他模式，因为在任何免费的在线材料中，封装“数据帧上的操作”似乎覆盖得很差（如果有的话）。我普遍怀疑我最好的尝试是否是最好的选择；）+1：当然是简洁性的提高，速度也更快（.to_numpy（）.T
稍微快一点。）你还有其他方法来构造这样的代码吗？其中有N个输入列和M个输出列（可能会或可能不会被“添加”到数据帧）？我在示例数据帧上尝试了您的函数，没有进行转换。它可以工作，返回一个数据帧。很好的地方，谢谢你，我正在学习：）这让我陷入了一个迷宫，让我的其他函数同时使用数据帧和numpy数组，我管理了它们，但将“混乱”移到将输出写回数据帧df['x_']、df['y_']、df['z_']=返回的_df.to_numpy（）.T允许我为列指定任何我喜欢的名称，并避免笨拙的重命名。我不知道有没有更好的办法？我想这是另一个问题：）
def scale_unit_cube_to_unit_sphere(*values):
    """
    Scales all the inputs (on a row basis for array_line types) such that when
    treated as n-dimensional vectors, their scale is always 1.

    (Divides the vector represented by each row of inputs by that row's
     root-of-sum-of-squares, so as to normalise to a unit magnitude.)

    Examples - Scalar Inputs
    --------

    >>> scale_unit_cube_to_unit_sphere(1, 1, 1)
    [0.5773502691896258, 0.5773502691896258, 0.5773502691896258]

    Examples - Array Like Inputs
    --------

    >>> x = [ 1, 2, 3]
    >>> y = [ 1, 4, 3]
    >>> z = [ 1,-3,-1]
    >>> scale_unit_cube_to_unit_sphere(x, y, z)
    [array([0.57735027, 0.37139068, 0.6882472 ]),
     array([0.57735027, 0.74278135, 0.6882472 ]),
     array([ 0.57735027, -0.55708601, -0.22941573])]

    >>> a = np.array([x, y, z])
    >>> scale_unit_cube_to_unit_sphere(*a)
    [array([0.57735027, 0.37139068, 0.6882472 ]),
     array([0.57735027, 0.74278135, 0.6882472 ]),
     array([ 0.57735027, -0.55708601, -0.22941573])]

    scale_unit_cube_to_unit_sphere(*t)
    >>> t = (x, y, z)
    >>> scale_unit_cube_to_unit_sphere(*t)
    [array([0.57735027, 0.37139068, 0.6882472 ]),
     array([0.57735027, 0.74278135, 0.6882472 ]),
     array([ 0.57735027, -0.55708601, -0.22941573])]

    >>> df = pd.DataFrame(data={'x':x,'y':y,'z':z})
    >>> scale_unit_cube_to_unit_sphere(df['x'], df['y'], df['z'])
    [0    0.577350
     1    0.371391
     2    0.688247
     dtype: float64,
     0    0.577350
     1    0.742781
     2    0.688247
     dtype: float64,
     0    0.577350
     1   -0.557086
     2   -0.229416
     dtype: float64]

    For all array_like inputs, the results can then be utilised in similar
    ways, such as writing them to an existing DataFrame as follows:

    >>> transform = scale_unit_cube_to_unit_sphere(df['x'], df['y'], df['z'])
    >> df['i'], df['j'], df['k'] = transform

    """
    # Scale the position in space to be a unit vector, as on the surface of a sphere
    ################################################################################

    scaler = np.sqrt(sum([np.multiply(v, v) for v in values]))
    return [np.divide(v, scaler) for v in values]

cube, ix = generate_index_cube(vertices_per_edge)

df = pd.DataFrame(
         data  = {
             'x': cube[0],
             'y': cube[1],
             'z': cube[2],
         },
         index = ix,
     )

unit = scale_index_to_unit(vertices_per_edge, *cube)

distortion = scale_unit_to_distortion(distortion_factor, *unit)

df['a'], df['b'], df['c'] = distortion

sphere = scale_unit_cube_to_unit_sphere(*distortion)

df['i'], df['j'], df['k'] = sphere

recovered_distortion = scale_unit_sphere_to_unit_cube(*sphere)

df['a_'], df['b_'], df['c_'] = recovered_distortion

recovered_cube = scale_unit_to_index(
                     vertices_per_edge,
                     *scale_distortion_to_unit(
                         distortion_factor,
                         *recovered_distortion,
                     ),
                 )

df['x_'], df['y_'], df['z_'] = recovered_cube

print(len(df[np.logical_not(np.isclose(df['a'], df['a_']))]))  # No Differences
print(len(df[np.logical_not(np.isclose(df['b'], df['b_']))]))  # No Differences
print(len(df[np.logical_not(np.isclose(df['c'], df['c_']))]))  # No Differences

print(len(df[np.logical_not(np.isclose(df['x'], df['x_']))]))  # No Differences
print(len(df[np.logical_not(np.isclose(df['y'], df['y_']))]))  # No Differences
print(len(df[np.logical_not(np.isclose(df['z'], df['z_']))]))  # No Differences