Python 如何与数据帧中的前一行相比,识别行中的字符串更改?
我有一个来自熊猫的Python 如何与数据帧中的前一行相比,识别行中的字符串更改?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个来自熊猫的数据帧: import pandas as pd inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Addr
数据帧
:
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}]
df = pd.DataFrame(inp)
print (df)
如果一行的字符串值与前一行相比发生变化,我想在单独的一行“Cng地址”中标识它,如果行的数值发生变化,则在“Cng年”列中标识它。如果没有变化,则将其标识为零
索引是“Name”,这意味着应该对与人名关联的所有行进行上述计算。如果“姓名”发生变化(即John变为Steve),则“Cng地址”和“Cng年份”的计算应重新设置。列年份按升序排序
作为最终报告,我希望得到:
- 约翰换了“1”年,换了“2”次地点
- 史蒂夫换了两年,换了两次地点
- 2019年变更地址总数为“2”倍
+-------+------+---------------+
| Name | Year | Address |
+-------+------+---------------+
| John | 2018 | Beverly hills |
+-------+------+---------------+
| John | 2018 | Beverly hills |
+-------+------+---------------+
| John | 2019 | Beverly hills |
+-------+------+---------------+
| John | 2019 | Orange county |
+-------+------+---------------+
| John | 2019 | New York |
+-------+------+---------------+
| Steve | 2018 | Canada |
+-------+------+---------------+
| Steve | 2019 | Canada |
+-------+------+---------------+
| Steve | 2019 | Canada |
+-------+------+---------------+
| Steve | 2020 | California |
+-------+------+---------------+
| Steve | 2020 | Canada |
+-------+------+---------------+
理想输出:
+-------+------+---------------+----------+-------------+
| Name | Year | Address | Cng-Year | Cng-Address |
+-------+------+---------------+----------+-------------+
| John | 2018 | Beverly hills | 0 | 0 |
+-------+------+---------------+----------+-------------+
| John | 2018 | Beverly hills | 0 | 0 |
+-------+------+---------------+----------+-------------+
| John | 2019 | Beverly hills | 1 | 0 |
+-------+------+---------------+----------+-------------+
| John | 2019 | Orange county | 0 | 1 |
+-------+------+---------------+----------+-------------+
| John | 2019 | New York | 0 | 1 |
+-------+------+---------------+----------+-------------+
| Steve | 2018 | Canada | 0 | 0 |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada | 1 | 0 |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada | 0 | 0 |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | California | 1 | 1 |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | Canada | 0 | 1 |
+-------+------+---------------+----------+-------------+
您可以使用滚动并检查该值是否等于上述值:
df['Cng-Year'] = df.groupby('Name')['Year'].transform(lambda x: x.rolling(2).agg(lambda x: x.iloc[0]!=x.iloc[1]).fillna(0))
df['Cng-Address'] = df.groupby('Name')['Address'].transform(lambda x: x.rolling(2).agg(lambda x: x.iloc[0]!=x.iloc[1]).fillna(0))
您可以使用将该行与上一行进行比较:
df["Cng-Year"] = ((df["Year"] != df["Year"].shift(1)) & (df["Name"] == df["Name"].shift())).astype(int)
df["Cng-Address"] = ((df["Address"] != df["Address"].shift(1)) & (df["Name"] == df["Name"].shift())).astype(int)
#df[['Cng-Year','Cng-Address']]=df[['Cng-Year','Cng-Address']].replace(True,1).replace(False,0) OR
#df[['Cng-Year','Cng-Address']] = np.where(df[['Cng-Year','Cng-Address']], 1,0)
您可以使用groupby:
groups = df.groupby('Name')
for col in ['Year', 'Address']:
df[f'cng-{col}'] = groups[col].shift().fillna(df[col]).ne(df[col]).astype(int)
输出:
Name Year Address cng-Year cng-Address
0 John 2018 Beverly hills 0 0
1 John 2018 Beverly hills 0 0
2 John 2019 Beverly hills 1 0
3 John 2019 Orange county 0 1
4 John 2019 New York 0 1
5 Steve 2018 Canada 0 0
6 Steve 2019 Canada 1 0
7 Steve 2019 Canada 0 0
8 Steve 2020 California 1 1
9 Steve 2020 Canada 0 1
这回答了你的问题吗@AMC我在发布之前就注意到了这一点。我的问题的转折点是在变更标识之前进行分组和重新索引。我已经尝试过这种方法,但对于大数据显然不是很有效!我有3亿行,代码运行了6个小时,还没有完成。这段代码对我来说非常好!您能谈谈如何计算“2020年”期间“加拿大”和“加利福尼亚”之间的地址变更总额吗?
groups[col].shift()
在每个名称中按1
移动相应的列fillna(df[col]
用原始值填充每个(移位)组的第一行,表示没有更改。最后,ne(df[col])
将移位值与原始值进行比较以进行更改。我问了一个新版本的问题,包括你的答案作为部分解决方案。如果你能看一看,那就太好了!