Python 如何与数据帧中的前一行相比,识别行中的字符串更改?

Python 如何与数据帧中的前一行相比,识别行中的字符串更改?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个来自熊猫的数据帧: import pandas as pd inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Addr

我有一个来自熊猫的
数据帧

import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}]
df = pd.DataFrame(inp)
print (df)
如果一行的字符串值与前一行相比发生变化,我想在单独的一行“Cng地址”中标识它,如果行的数值发生变化,则在“Cng年”列中标识它。如果没有变化,则将其标识为零

索引是“Name”,这意味着应该对与人名关联的所有行进行上述计算。如果“姓名”发生变化(即John变为Steve),则“Cng地址”和“Cng年份”的计算应重新设置。列年份按升序排序

作为最终报告,我希望得到:

  • 约翰换了“1”年,换了“2”次地点
  • 史蒂夫换了两年,换了两次地点
  • 2019年变更地址总数为“2”倍
电流输出:

+-------+------+---------------+
| Name  | Year | Address       |
+-------+------+---------------+
| John  | 2018 | Beverly hills |
+-------+------+---------------+
| John  | 2018 | Beverly hills |
+-------+------+---------------+
| John  | 2019 | Beverly hills |
+-------+------+---------------+
| John  | 2019 | Orange county |
+-------+------+---------------+
| John  | 2019 | New York      |
+-------+------+---------------+
| Steve | 2018 | Canada        |
+-------+------+---------------+
| Steve | 2019 | Canada        |
+-------+------+---------------+
| Steve | 2019 | Canada        |
+-------+------+---------------+
| Steve | 2020 | California    |
+-------+------+---------------+
| Steve | 2020 | Canada        |
+-------+------+---------------+
理想输出:

+-------+------+---------------+----------+-------------+
| Name  | Year | Address       | Cng-Year | Cng-Address |
+-------+------+---------------+----------+-------------+
| John  | 2018 | Beverly hills | 0        | 0           |
+-------+------+---------------+----------+-------------+
| John  | 2018 | Beverly hills | 0        | 0           |
+-------+------+---------------+----------+-------------+
| John  | 2019 | Beverly hills | 1        | 0           |
+-------+------+---------------+----------+-------------+
| John  | 2019 | Orange county | 0        | 1           |
+-------+------+---------------+----------+-------------+
| John  | 2019 | New York      | 0        | 1           |
+-------+------+---------------+----------+-------------+
| Steve | 2018 | Canada        | 0        | 0           |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada        | 1        | 0           |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada        | 0        | 0           |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | California    | 1        | 1           |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | Canada        | 0        | 1           |
+-------+------+---------------+----------+-------------+

您可以使用滚动并检查该值是否等于上述值:

df['Cng-Year'] = df.groupby('Name')['Year'].transform(lambda x: x.rolling(2).agg(lambda x: x.iloc[0]!=x.iloc[1]).fillna(0))
df['Cng-Address'] = df.groupby('Name')['Address'].transform(lambda x: x.rolling(2).agg(lambda x: x.iloc[0]!=x.iloc[1]).fillna(0))
您可以使用将该行与上一行进行比较:

df["Cng-Year"] = ((df["Year"] != df["Year"].shift(1)) & (df["Name"] == df["Name"].shift())).astype(int)
df["Cng-Address"] = ((df["Address"] != df["Address"].shift(1)) & (df["Name"] == df["Name"].shift())).astype(int)
#df[['Cng-Year','Cng-Address']]=df[['Cng-Year','Cng-Address']].replace(True,1).replace(False,0) OR
#df[['Cng-Year','Cng-Address']] = np.where(df[['Cng-Year','Cng-Address']], 1,0)

您可以使用groupby:

groups = df.groupby('Name')

for col in ['Year', 'Address']:
    df[f'cng-{col}'] = groups[col].shift().fillna(df[col]).ne(df[col]).astype(int)
输出:

    Name  Year        Address  cng-Year  cng-Address
0   John  2018  Beverly hills         0            0
1   John  2018  Beverly hills         0            0
2   John  2019  Beverly hills         1            0
3   John  2019  Orange county         0            1
4   John  2019       New York         0            1
5  Steve  2018         Canada         0            0
6  Steve  2019         Canada         1            0
7  Steve  2019         Canada         0            0
8  Steve  2020     California         1            1
9  Steve  2020         Canada         0            1

这回答了你的问题吗@AMC我在发布之前就注意到了这一点。我的问题的转折点是在变更标识之前进行分组和重新索引。我已经尝试过这种方法,但对于大数据显然不是很有效!我有3亿行,代码运行了6个小时,还没有完成。这段代码对我来说非常好!您能谈谈如何计算“2020年”期间“加拿大”和“加利福尼亚”之间的地址变更总额吗?
groups[col].shift()
在每个名称中按
1
移动相应的列
fillna(df[col]
用原始值填充每个(移位)组的第一行,表示没有更改。最后,
ne(df[col])
将移位值与原始值进行比较以进行更改。我问了一个新版本的问题,包括你的答案作为部分解决方案。如果你能看一看,那就太好了!