Python 应用中的交叉引用数据帧_Python_Pandas_Apply

Python 应用中的交叉引用数据帧

python pandas

Python 应用中的交叉引用数据帧,python,pandas,apply,Python,Pandas,Apply,所以我有一些数据，比如： a.csv: id, ..., name 1234, ..., R 1235, ..., Python 1236, ..., Panda ... etc b.csv: id, ..., amount 1234, ..., 1 1234, ..., 1 1234, ..., 2 ... 1236, ..., 1 1236, ..., 1 id amount 0 1234 1 1 1234 1 2 1234

所以我有一些数据，比如：

a.csv:

id, ..., name
1234, ..., R
1235, ..., Python
1236, ..., Panda
... etc

b.csv:

id, ..., amount
1234, ..., 1
1234, ..., 1
1234, ..., 2
...
1236, ..., 1
1236, ..., 1

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

我正在尝试交叉引用a.csv和b.csv之间的ID，以便将数量列添加到a.csv的pandas数据框中。该数量是“该行匹配ID的b.csv金额之和”

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

我正在尝试使用apply函数，如下所示：

  import pandas as pd
  def itemcounts(row):
      # ok this works?
      # return b[b['id'] == 1234]['amount'].sum()
      # each a['quantity'] gets set to 4 or whatever the sum for 1234 is.

      # and this does?
      # return row['id']
      # a['quantity'] get set to whatever row's 'id' is.

      # but this doesn't
      id = row['id']
      return b[b['id'] == id]['amount'].sum()
      # a['quantity'] is 0.

  a = pd.read_csv('a.csv')
  b = pd.read_csv('b.csv')
  a['quantity'] = a.apply(itemcounts, axis=1)

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

但是正如在注释中所指出的，我无法使用apply在

中找到匹配的行来获得总和。我想我在这里遗漏了python或pandas的一些基本功能

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

我尝试在itemcounts中将

行['id']

强制转换为int，但仍然没有成功。

尝试以下操作：

df = pd.DataFrame({'id' : [1234, 1235, 1236], 'name' : ['R', 'Python', 'Pandas']})

     id    name
0  1234       R
1  1235  Python
2  1236  Pandas

df1 = pd.DataFrame({'id' : [1234, 1234, 1234, 1234, 1234, 1235, 1235, 1236], 'amount' : [1, 1, 2, 1, 2, 2, 1, 1]})

   amount    id
0       1  1234
1       1  1234
2       2  1234
3       1  1234
4       2  1234
5       2  1235
6       1  1235
7       1  1236

df['quantity'] = df1.groupby('id').agg(sum).values

     id    name  quantity
0  1234       R         7
1  1235  Python         3
2  1236  Pandas         1

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

这个脚本对我很有用：

import pandas as pd
a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')

a['Quantity'] = a['id'].apply(lambda x: b[b.id == x].amount.sum())

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

在apply函数中使用“lambda”可以将列的每一行作为“x”应用到函数中

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

采取行动：

    id    name
0  1234       r        
1  1235  Python       
2  1236   Panda

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

b：

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

它返回：

     id  amount   
0  1234       1
1  1234       1
2  1234       2
3  1236       1
4  1236       1

        id    name  Quantity
0     1234       r         4
1     1235  Python         0
2     1236   Panda         2

它必须使用

apply

？我有一个解决方案，但没有。谢谢你在这方面的帮助。原来我是肮脏数据的受害者。这两个文件大约有1000行，但ID并没有在所有情况下都对齐。该解决方案给了我“ValueError:Length of values not match Length of index”（值的长度与索引的长度不匹配），因为如上所述，我的数据是脏的。这对于确定问题非常有价值。谢谢大家，谢谢大家，因为我的解决方案和你们的解决方案都很有效。我的问题是a.csv有1000行，b.csv有1000行，但是对于a.csv中的每一行，我们有一个ID，而b.csv中的每个ID可能有200行。因此，当我看到所有（可见）结果的“数量：0”时，我当然认为我是愚蠢的，而不是我的数据。