Python 数据帧中的滞后变异
我有一些R使用突变和滞后。我想在熊猫身上复制这个。这是数据 已编辑:包括分组依据和索引的需要Python 数据帧中的滞后变异,python,pandas,Python,Pandas,我有一些R使用突变和滞后。我想在熊猫身上复制这个。这是数据 已编辑:包括分组依据和索引的需要 Name Date_x 0 American 2009-10-31 1 American 2009-09-22 2 Zydaco 2009-09-26 3 American 2009-04-17 4 American 2009-02-18 5 American 2009-02-03 6 A
Name Date_x
0 American 2009-10-31
1 American 2009-09-22
2 Zydaco 2009-09-26
3 American 2009-04-17
4 American 2009-02-18
5 American 2009-02-03
6 American 2009-01-16
7 Catalina 2009-09-02
8 Zydaco 2009-08-29
9 Zydaco 2009-08-15
10 Zydaco 2009-06-26
11 Zydaco 2009-10-27
12 Zydaco 2009-10-13
13 Zydaco 2009-04-04
这是R码
test<- test %.% #need dplyr %.%
group_by(name) %.%
mutate(Date_y = lag(Date_x, 1))
这是尝试使用.shift创建输出。这似乎更有效率。但是输出是不正确的
编辑;论证问题
test['Date_y'] = test.groupby(['Name'])['Date_x'].shift(-1)
test.sort(['Name', 'Date_x'], ascending=[1, 0])
Name Date_x Date_y
American 2009-10-31 2009-09-22
American 2009-09-22 2009-04-17
American 2009-04-17 2009-02-18
American 2009-02-18 2009-02-03
American 2009-02-03 2009-01-16
American 2009-01-16 NaN
Catalina 2009-09-02 NaN
Zydaco 2009-10-27 2009-10-13
Zydaco 2009-10-13 2009-04-04
Zydaco 2009-09-26 2009-08-29
Zydaco 2009-08-29 2009-08-15
Zydaco 2009-08-15 2009-06-26
Zydaco 2009-06-26 2009-10-27
Zydaco 2009-04-04 NaN
实现这一目标的最佳方式是什么?如果有效,我想使用.shift。
还是有更好的办法
这条线不正确
Zydaco 2009-06-26 2009-10-27
这再现了错误
df = pd.Series (['American','American','Zydaco','American','American','American','American','Catalina',
'Zydaco','Zydaco','Zydaco','Zydaco','Zydaco','Zydaco'])
df = pd.DataFrame(df)
df.columns = ['names']
df['date_x'] = pd.Series(['2009-10-31','2009-09-22','2009-09-26','2009-04-17','2009-02-18',' 2009-02- 03','2009-01-16','2009-09-02','2009-08-29','2009-08-15',' 2009-06-26',' 2009-10-27','2009-10-13','2009- 04-04'])
df['date_y'] = df.groupby(['names'])['date_x'].shift(-1)
mask = df['names'] == "Zydaco"
df = df[mask]
df['date_x'] = pd.to_datetime(df['date_x'])
df.groupby('date_x').apply(lambda d: d.sort()).reset_index('date_x',drop=True)
日期x从最远的日期到最近的日期。似乎shift不使用日期顺序,而是使用索引顺序进行移位
names date_x date_y
13 Zydaco 2009-04-04 NaN
10 Zydaco 2009-06-26 2009-10-27
9 Zydaco 2009-08-15 2009-06-26
8 Zydaco 2009-08-29 2009-08-15
2 Zydaco 2009-09-26 2009-08-29
12 Zydaco 2009-10-13 2009- 04-04
11 Zydaco 2009-10-27 2009-10-13
您的数据一开始没有排序,因此将按此无序顺序移动。如果要以排序方式对其进行移位,请首先在groupby之前对其进行排序。例如:
In [49]: test['Date_y'] = test.sort('Date_x', ascending=False).groupby(['Name'])'Date_x'].shift(-1)
In [50]: test.sort(['Name', 'Date_x'], ascending=[1, 0])
Out[50]:
Name Date_x Date_y
i
0 American 2009-10-31 2009-09-22
1 American 2009-09-22 2009-04-17
3 American 2009-04-17 2009-02-18
4 American 2009-02-18 2009-02-03
5 American 2009-02-03 2009-01-16
6 American 2009-01-16 NaN
7 Catalina 2009-09-02 NaN
11 Zydaco 2009-10-27 2009-10-13
12 Zydaco 2009-10-13 2009-09-26
2 Zydaco 2009-09-26 2009-08-29
8 Zydaco 2009-08-29 2009-08-15
9 Zydaco 2009-08-15 2009-06-26
10 Zydaco 2009-06-26 2009-04-04
13 Zydaco 2009-04-04 NaN
我不知道您到底是如何得到结果的(一个完全可运行的示例会有所帮助),但如果我运行类似的程序,我会得到:
In [26]: s="""Name Date_x Rank
....: American 2009-10-31 6
....: American 2009-09-22 5
....: American 2009-04-17 4
....: American 2009-02-18 3
....: American 2009-02-03 2
....: American 2009-01-16 1
....: Catalina 2009-09-02 1
....: Zydaco 2009-10-27 7
....: Zydaco 2009-10-13 6
....: Zydaco 2009-09-26 5
....: Zydaco 2009-08-29 4
....: Zydaco 2009-08-15 3
....: Zydaco 2009-06-26 2
....: Zydaco 2009-04-04 1"""
In [27]: test = pd.read_csv(StringIO(s), delim_whitespace=True)
In [29]: test['Date_y'] = test.groupby(['Name'])['Date_x'].shift(-1)
In [30]: test
Out[30]:
Name Date_x Rank Date_y
0 American 2009-10-31 6 2009-09-22
1 American 2009-09-22 5 2009-04-17
2 American 2009-04-17 4 2009-02-18
3 American 2009-02-18 3 2009-02-03
4 American 2009-02-03 2 2009-01-16
5 American 2009-01-16 1 NaN
6 Catalina 2009-09-02 1 NaN
7 Zydaco 2009-10-27 7 2009-10-13
8 Zydaco 2009-10-13 6 2009-09-26
9 Zydaco 2009-09-26 5 2009-08-29
10 Zydaco 2009-08-29 4 2009-08-15
11 Zydaco 2009-08-15 3 2009-06-26
12 Zydaco 2009-06-26 2 2009-04-04
13 Zydaco 2009-04-04 1 NaN
这是你想要的吗?或者它有什么问题
请注意,在这种情况下,您不需要groupby,因为
name
列中只有一个名称,但我想这是因为您简化了示例。我编辑了问题以包含其他名称,因此groupby是必需的。我想问题已经被证明了。你用的是哪种版本的熊猫?我用你的新数据更新了上面的版本。我得到的输出有问题吗?(对我来说这似乎是正确的)我知道你在跑,你还有别的吗?我的熊猫是“0.14.1”。你得到的是正确的。当我运行它时,我得到了其他东西。看看我在问题末尾得到了什么。我把索引添加到了测试数据中。这一定是问题所在。如果我用0.14.1运行我的代码,我会得到与上面发布的完全相同的结果(用0.15.1运行)。请提供一个显示错误的完全可再现的示例。您的数据框有一个Date\u x
列,但在groupby中您使用Date
。那是错别字吗?是的,我搞定了。谢谢
In [26]: s="""Name Date_x Rank
....: American 2009-10-31 6
....: American 2009-09-22 5
....: American 2009-04-17 4
....: American 2009-02-18 3
....: American 2009-02-03 2
....: American 2009-01-16 1
....: Catalina 2009-09-02 1
....: Zydaco 2009-10-27 7
....: Zydaco 2009-10-13 6
....: Zydaco 2009-09-26 5
....: Zydaco 2009-08-29 4
....: Zydaco 2009-08-15 3
....: Zydaco 2009-06-26 2
....: Zydaco 2009-04-04 1"""
In [27]: test = pd.read_csv(StringIO(s), delim_whitespace=True)
In [29]: test['Date_y'] = test.groupby(['Name'])['Date_x'].shift(-1)
In [30]: test
Out[30]:
Name Date_x Rank Date_y
0 American 2009-10-31 6 2009-09-22
1 American 2009-09-22 5 2009-04-17
2 American 2009-04-17 4 2009-02-18
3 American 2009-02-18 3 2009-02-03
4 American 2009-02-03 2 2009-01-16
5 American 2009-01-16 1 NaN
6 Catalina 2009-09-02 1 NaN
7 Zydaco 2009-10-27 7 2009-10-13
8 Zydaco 2009-10-13 6 2009-09-26
9 Zydaco 2009-09-26 5 2009-08-29
10 Zydaco 2009-08-29 4 2009-08-15
11 Zydaco 2009-08-15 3 2009-06-26
12 Zydaco 2009-06-26 2 2009-04-04
13 Zydaco 2009-04-04 1 NaN