Python 如何更好地执行此numpy计算_Python_Numpy

Python 如何更好地执行此numpy计算

python numpy

Python 如何更好地执行此numpy计算,python,numpy,Python,Numpy,我有这样的文本文件： 0 0 0 1 2 0 0 1 3 1 0 1 0 4 1 0 1 1 2 3 1 0 0 5 3 1 0 1 1 3 1 1 0 4 5 1 1 1 6 1 让我们将这些列标记为： s1 a s2 r t 我还有另一个带有伪值的数组（为了简单起见）我想对这些性能良好的数字进行一定的计算。我想要执行的计算是：对于每个s1，我想要每个a的最大和t*（r+V[s1]）。比如说, 对于s1=0，a=0，我们将得到sum=2*（1+10）+1*（3+10）=35 对于s1

我有这样的文本文件：

让我们将这些列标记为：

s1 a s2 r t

我还有另一个带有伪值的数组（为了简单起见）

我想对这些性能良好的数字进行一定的计算。我想要执行的计算是：对于每个

s1

，我想要每个

的最大和

t*（r+V[s1]）

。比如说,

对于s1=0，a=0，我们将得到sum=2*（1+10）+1*（3+10）=35
对于s1=0，a=1，求和=1*（4+10）+3*（2+10）=50

所以最大值是50，这是我想要作为

s1=0

的输出得到的值另外，请注意，在上述计算中，10是

V[s1]

如果文件中没有最后三行，那么对于

s1=1

，我将简单地返回

3*（5+20）=75

，其中

是

V[s1]

。因此，最终的期望结果是

[50,75]

因此，我认为numpy最好按如下方式加载它（为了简单起见，只考虑s1=0的值）

Q1.我无法猜测，我如何修改上面的内容以获得numpy数组

[45.0,80.0]

，这样我就可以在上面获得

numpy.max

Q2.当我实际加载文件时，我无法按照上面评论中所述的

c1arr

加载文件。相反，我得到的信息如下：

>>> type(a) #a is populated by parsing file
<class 'list'>

>>> print(a)
[[[[0, -0.9, 0.3], [1, 0.9, 0.6]], [[0, -0.2, 0.6], [1, 0.7, 0.3]]], [[[1, 0.2, 1.0]], [[0, -0.8, 1.0]]]]

>>> np.array(a) #note that this is not same as c1arr above
<string>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
array([[list([[0, -0.9, 0.3], [1, 0.9, 0.6]]),
        list([[0, -0.2, 0.6], [1, 0.7, 0.3]])],
       [list([[1, 0.2, 1.0]]),
        list([[0, -0.8, 1.0]])]], dtype=object)

df = pd.read_csv('Input.txt', sep=' ', names=['s1', 'a', 's2', 'r', 't'])

>类型（a）#a由解析文件填充
>>>印刷品（a）
[[[[0, -0.9, 0.3], [1, 0.9, 0.6]], [[0, -0.2, 0.6], [1, 0.7, 0.3]]], [[[1, 0.2, 1.0]], [[0, -0.8, 1.0]]]]
>>>np.array（a）#注意，这与上面的c1arr不同
：1:VisibleDeprecationWarning:不推荐使用不规则嵌套序列（即具有不同长度或形状的列表或列表元组、元组或数据数组）创建数据数组。如果要执行此操作，则必须在创建数据阵列时指定“dtype=object”
数组（[[list（[[0，-0.9,0.3]，[1,0.9,0.6]]），
列表（[[0，-0.2,0.6]，[1,0.7,0.3]]），
[名单([1,0.2,1.0]],，
列表（[[0，-0.8,1.0]]）]]，数据类型=对象）

我怎样才能解决这个问题

第三季度。是否有更好的总体方法，比如以不同的方式布置numpy阵列？（考虑到我不允许使用熊猫，但只能使用numpy）

在我看来，最直观、最易于维护的方法是使用熊猫，您可以在其中为列指定名称。另一个重要因素是，仅在大熊猫身上，分组要容易得多

由于您的输入样本只包含整数，所以我定义了V 也作为整数数组：

V = np.array([10, 20])

我阅读了您的输入文件，如下所示：

>>> type(a) #a is populated by parsing file
<class 'list'>

>>> print(a)
[[[[0, -0.9, 0.3], [1, 0.9, 0.6]], [[0, -0.2, 0.6], [1, 0.7, 0.3]]], [[[1, 0.2, 1.0]], [[0, -0.8, 1.0]]]]

>>> np.array(a) #note that this is not same as c1arr above
<string>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
array([[list([[0, -0.9, 0.3], [1, 0.9, 0.6]]),
        list([[0, -0.2, 0.6], [1, 0.7, 0.3]])],
       [list([[1, 0.2, 1.0]]),
        list([[0, -0.8, 1.0]])]], dtype=object)

df = pd.read_csv('Input.txt', sep=' ', names=['s1', 'a', 's2', 'r', 't'])

（打印以查看已读内容）

然后，为了得到s1和a的每个组合的结果，您可以运行：

result = df.groupby(['s1', 'a']).apply(lambda grp:
    (grp.t * (grp.r + V[grp.s1])).sum())

请注意，在引用命名列时，此代码很容易阅读

结果是：

s1  a
0   0     35
    1     50
1   0    138
    1    146
dtype: int64

s1
0     50
1    146
dtype: int64

每个结果都是整数，因为V也是 int类型。但是如果你像在你的帖子中一样定义它（一个浮点数组），结果也将是浮点类型（你的选择）

如果要获得每个s1的最大结果，请运行：

这一次的结果是：

s1  a
0   0     35
    1     50
1   0    138
    1    146
dtype: int64

s1
0     50
1    146
dtype: int64

Numpy版本如果你真的被限制为Numpy，还有一个解决方案，虽然更难阅读和更新

读取输入文件：

data = np.genfromtxt('Input.txt')
最初我尝试了int类型，就像在pandasonic解决方案中一样，但您的一条评论指出，最右边的两列是float。因此，由于Numpy数组必须是单个类型，因此整个数组必须是浮点类型

运行以下代码：

res = [] # First level grouping - by "s1" (column 0) for s1 in np.unique(data[:,0]).astype(int): dat1 = data[np.where(data[:,0] == s1)] res2 = [] # Second level grouping - by "a" (column 1) for a in np.unique(dat1[:,1]): dat2 = dat1[np.where(dat1[:,1] == a)] # t - column 4, r - column 3 res2.append((dat2[:,4] * (dat2[:,3] + V[s1])).sum()) res.append([s1, max(res2)]) result = np.array(res)

结果（Numpy数组）为：
左列包含s1值和右-最大值将第二级分组中的值分组
具有结构化数组的Numpy版本实际上，您也可以使用Numpy结构化数组。那么代码至少更具可读性，因为您引用了列名，而不是列号
读取传递具有列名和类型的数据类型的数组：

data = np.genfromtxt(io.StringIO(txt), dtype=[('s1', '<i4'), ('a', '<i4'), ('s2', '<i4'), ('r', '<f8'), ('t', '<f8')])

也许更适合这一点？@RandomDavis前两个问题不在CR上……我知道我可以更好地使用熊猫，但这不是仅使用
numpy
？是的，你也可以单独使用numpy。我添加了这样的解决方案。这看起来很棒。尽管我只是好奇，但这种方法是否能从numpy矢量化中获得性能方面的好处？我有这个疑问，因为我想如果我们避免for循环，我们会从numpy矢量化中获得性能优势。避免循环通常是个好主意，但不幸的是，numpy没有本地groupby功能，例如Pandas，所以我使用循环实现分组。比较我的第一个（pandasonic）解决方案（使用groupby）和第二个（numpythonic）版本在一些更大数据样本上的执行时间。没有显式循环的pandasonic解决方案可能运行得更快（与Numpy代码运行得更快的一般规则相反）。关于你提出的关于res作为Numpy数组的问题：试一下，在一个更大的源数据样本上，但我认为变化不会很大。
res = [] # First level grouping - by "s1" for s1 in np.unique(data['s1']): dat1 = data[np.where(data['s1'] == s1)] res2 = [] # Second level grouping - by "a" for a in np.unique(dat1['a']): dat2 = dat1[np.where(dat1['a'] == a)] res2.append((dat2['t'] * (dat2['r'] + V[s1])).sum()) res.append([s1, max(res2)]) result = np.array(res)