Python将按差异与按列筛选的其他行进行比较_Python_Pandas

Python将按差异与按列筛选的其他行进行比较

python pandas

Python将按差异与按列筛选的其他行进行比较,python,pandas,Python,Pandas,我正在与groupby一起与Python熊猫搏斗。我应该如何做到以下几点？对于每种水果，我想找出与该水果的“0步”值的差异 df = pd.DataFrame({'Fruit' : ['Apple', 'Apple', 'Apple', 'Banana', 'Banana', 'Banana'], 'Step' : [0, 1, 2, 0, 1, 2], 'Value' : [100, 102, 105, 200, 210, 195] }) Fruit Step Value

我正在与groupby一起与Python熊猫搏斗。我应该如何做到以下几点？对于每种水果，我想找出与该水果的“0步”值的差异

df = pd.DataFrame({'Fruit' : ['Apple', 'Apple', 'Apple', 'Banana', 'Banana', 'Banana'], 'Step' : [0, 1, 2, 0, 1, 2], 'Value' : [100, 102, 105, 200, 210, 195] })

    Fruit  Step  Value     to-be
0   Apple     0    100  -->  0
1   Apple     1    102  -->  2
2   Apple     2    105  -->  5
3  Banana     0    200  -->  0
4  Banana     1    210  --> 10
5  Banana     2    195  --> -5

谢谢大家!

这应该可以做到：

df.groupby('Fruit').apply(lambda g: g.Value - g[g.Step == 0].Value.values[0])

首先，我们按您关心的列（水果）进行分组。然后我们对每个组应用一个函数（使用

lambda

，它允许我们在线指定一个函数）。对于每个组，我们找到

g.Step==0

的行，然后从该行获取值条目，并使用

values[0]

获取第一个值（如果有多个位置

g.Step==0

）。然后我们只需从组中的所有行中减去该值，并返回它

如果要将其作为列添加到数据帧，可以删除索引：

res = df.groupby('Fruit').apply(lambda g: g.Value - g[g.Step == 0].Value.values[0])
df['Result'] = res.reset_index(drop=True)

我想这就行了。它只是在行中循环，并在每次步长等于0时应用新的“first”值。然后计算与第一个值的差值

rows = range(df.shape[0])
df['count'] = 0
for r in rows:
    step = df.iloc[r,1]
    value = df.iloc[r,2]
    if step == 0:
        first = value
    df.iloc[r,3] = value - first

我是熊猫的新手，但至少下面的代码是有效的。结果呢,

    Fruit  Step  Value  to-be
0   Apple     0    100      0
1   Apple     1    102      2
2   Apple     2    105      5
3  Banana     0    200      0
4  Banana     1    210     10
5  Banana     2    195     -5

[6 rows x 4 columns]

源代码如下

import pandas as pd

df = pd.DataFrame({'Fruit' : ['Apple', 'Apple', 'Apple', 'Banana', 'Banana', 'Banana'], 
                    'Step' : [0, 1, 2, 0, 1, 2], 
                    'Value' : [100, 102, 105, 200, 210, 195] })

list_groups = list()

# loop over dataframe groupby `Fruit`
for name, group in df.groupby('Fruit'):
    group.sort('Step', ascending=True) # sorted by `Step`

    row_iterator = group.iterrows()

    # get the base value
    idx, first_row = row_iterator.next()
    base_value = first_row['Value']

    to_be = [0] # store the values of the column `to-be`
    for idx, row in row_iterator:
        to_be.append(row['Value'] - base_value)

    # add a column to group
    group['to-be'] = pd.Series(to_be, index=group.index)

    list_groups.append(group)


# Concatenate dataframes
result = pd.concat(list_groups)

print(result)

@ASGM，我运行你的代码

res = df.groupby('Fruit').apply(lambda g: g.Value - g[g.Step == 0].Value.values[0])
df['Result'] = res.reset_index(drop=True)

但是遇到这个问题,

Traceback (most recent call last):
  File "***.py", line 9, in <module>
    df['Result'] = res.reset_index(drop=True)
  File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 1887, in __setitem__
    self._set_item(key, value)
  File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 1968, in _set_item
    NDFrame._set_item(self, key, value)
  File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 1068, in _set_item
    self._data.set(key, value)
  File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 3024, in set
    self.insert(len(self.items), item, value)
  File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 3039, in insert
    self._add_new_block(item, value, loc=loc)
  File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 3162, in _add_new_block
    self.items, fastpath=True)
  File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 1993, in make_block
    placement=placement)
  File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 64, in __init__
    '%d' % (len(items), len(values)))
ValueError: Wrong number of items passed 1, indices imply 3
[Finished in 0.4s with exit code 1]

回溯（最近一次呼叫最后一次）：
文件“***.py”，第9行，在
df['Result']=res.reset_索引（drop=True）
文件“/usr/lib/python2.7/dist-packages/pandas/core/frame.py”，第1887行，在__
自我设置项目（键、值）
文件“/usr/lib/python2.7/dist packages/pandas/core/frame.py”，第1968行，在集合项中
NDFrame.\u设置\u项（自身、键、值）
文件“/usr/lib/python2.7/dist packages/pandas/core/generic.py”，第1068行，在集合项中
self.\u数据集（键、值）
文件“/usr/lib/python2.7/dist packages/pandas/core/internals.py”，第3024行，在集合中
self.insert（len（self.items）、item、value）
文件“/usr/lib/python2.7/dist packages/pandas/core/internals.py”，第3039行，插入
自添加新块（项目、值、loc=loc）
文件“/usr/lib/python2.7/dist packages/pandas/core/internals.py”，第3162行，在添加新块中
self.items，fastpath=True）
文件“/usr/lib/python2.7/dist packages/pandas/core/internals.py”，第1993行，在make_块中
放置=放置）
文件“/usr/lib/python2.7/dist packages/pandas/core/internals.py”，第64行，在__
“%d%”（len个项目，len个值）
ValueError:传递的项目数错误1，索引暗示为3
[完成时间为0.4s，退出代码为1]

这肯定有效，但在较大的数据帧上，它将比分组慢。一般来说，遍历单个行并不能充分利用pandas提供的功能。（我提到这一点并不是为了攻击你的答案，而是因为在我开始学习熊猫时，有人给了我同样的建议，这真的很有帮助）。我完全同意，你的答案在这里肯定更有力。然而，我有时确实认为（特别是因为这似乎是针对初学者的），一些简单的东西可以更容易地处理和添加。刚开始的时候，我发现“lambda”很难让我动脑！真的！看到有多种方法做事情总是很有用的（我很确定有一种方法比我输入的方法更好）。这一行

df['Result']=res.reset_index（drop=True）

对

ValueError不适用：传递的项数错误1，索引意味着3

@sparkandshine这很奇怪。您使用的是什么版本的python？我在2.7.3上运行得很好。我的Python版本是

2.7.6

。我将回溯粘贴到我的anser上。@sparkandshine现在更新，看看会发生什么。