Python 按州和县划分的房价指数3年升值情况
我有一个如下所示的数据集: 我想在hpi中找到三年的增值。请注意,hpi处于tract水平,年份范围为2012-2018年 数据集包含了所有的州和县,比我刚才演示的要大得多。我想使用一些类型的group by lambda函数,例如,当我想根据年份、州和县计算hpi的中位数时Python 按州和县划分的房价指数3年升值情况,python,pandas,lambda,group-by,Python,Pandas,Lambda,Group By,我有一个如下所示的数据集: 我想在hpi中找到三年的增值。请注意,hpi处于tract水平,年份范围为2012-2018年 数据集包含了所有的州和县,比我刚才演示的要大得多。我想使用一些类型的group by lambda函数,例如,当我想根据年份、州和县计算hpi的中位数时 medians = (all_data.groupby(['Year', 'state', 'County_name'])['hpi'] .transform(lambda x: x.media
medians = (all_data.groupby(['Year', 'state', 'County_name'])['hpi']
.transform(lambda x: x.median() if x.notnull().any() else np.nan)
)
all_data['hpi'] = all_data['hpi'].fillna(medians)
但是我无法为此目的修改上面的代码。非常感谢您的任何建议。我已在您的数据中添加了一个县,并为巴布尔县的HPI编制了虚构索引:
state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}
df = pd.DataFrame(data)
# Sorting is necessary.
df = df.sort_values(['State', 'County', 'Year'])
print(df)
Year State County HPI
0 2012 Alabama Baldin County 125.0
1 2013 Alabama Baldin County 130.0
2 2014 Alabama Baldin County 127.5
3 2015 Alabama Baldin County 142.0
4 2016 Alabama Baldin County 160.0
5 2017 Alabama Baldin County 139.0
6 2012 Alabama Barbour County 98.0
7 2013 Alabama Barbour County 108.0
8 2014 Alabama Barbour County 102.0
9 2015 Alabama Barbour County 115.0
10 2016 Alabama Barbour County 118.0
11 2017 Alabama Barbour County 114.0
在此基础上,我们移动“HPI”并进行分割,以给出您正在寻找的数据结果
df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)
print(df)
Year State County HPI 3 year appreciation
0 2012 Alabama Baldin County 125.0 NaN
1 2013 Alabama Baldin County 130.0 NaN
2 2014 Alabama Baldin County 127.5 NaN
3 2015 Alabama Baldin County 142.0 1.136000
4 2016 Alabama Baldin County 160.0 1.230769
5 2017 Alabama Baldin County 139.0 1.090196
6 2012 Alabama Barbour County 98.0 0.690141
7 2013 Alabama Barbour County 108.0 0.675000
8 2014 Alabama Barbour County 102.0 0.733813
9 2015 Alabama Barbour County 115.0 1.173469
10 2016 Alabama Barbour County 118.0 1.092593
11 2017 Alabama Barbour County 114.0 1.117647
但是,现在您在开始时有NAN,并且每个县的前三年都有不正确的值。为了纠正这个问题,我们按州/县分组,然后使用head(3)检索每个组的前三年,然后获取索引值,然后过滤并设置为零
df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
print(df)
Year State County HPI 3 year appreciation
0 2012 Alabama Baldin County 125.0 0.000000
1 2013 Alabama Baldin County 130.0 0.000000
2 2014 Alabama Baldin County 127.5 0.000000
3 2015 Alabama Baldin County 142.0 1.136000
4 2016 Alabama Baldin County 160.0 1.230769
5 2017 Alabama Baldin County 139.0 1.090196
6 2012 Alabama Barbour County 98.0 0.000000
7 2013 Alabama Barbour County 108.0 0.000000
8 2014 Alabama Barbour County 102.0 0.000000
9 2015 Alabama Barbour County 115.0 1.173469
10 2016 Alabama Barbour County 118.0 1.092593
11 2017 Alabama Barbour County 114.0 1.117647
总代码为:
import pandas as pd
state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}
df = pd.DataFrame(data)
df = df.sort_values(['State', 'County', 'Year'])
df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)
df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
我已在您的数据中添加了一个县,并为巴布尔县的HPI编制了虚构索引:
state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}
df = pd.DataFrame(data)
# Sorting is necessary.
df = df.sort_values(['State', 'County', 'Year'])
print(df)
Year State County HPI
0 2012 Alabama Baldin County 125.0
1 2013 Alabama Baldin County 130.0
2 2014 Alabama Baldin County 127.5
3 2015 Alabama Baldin County 142.0
4 2016 Alabama Baldin County 160.0
5 2017 Alabama Baldin County 139.0
6 2012 Alabama Barbour County 98.0
7 2013 Alabama Barbour County 108.0
8 2014 Alabama Barbour County 102.0
9 2015 Alabama Barbour County 115.0
10 2016 Alabama Barbour County 118.0
11 2017 Alabama Barbour County 114.0
在此基础上,我们移动“HPI”并进行分割,以给出您正在寻找的数据结果
df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)
print(df)
Year State County HPI 3 year appreciation
0 2012 Alabama Baldin County 125.0 NaN
1 2013 Alabama Baldin County 130.0 NaN
2 2014 Alabama Baldin County 127.5 NaN
3 2015 Alabama Baldin County 142.0 1.136000
4 2016 Alabama Baldin County 160.0 1.230769
5 2017 Alabama Baldin County 139.0 1.090196
6 2012 Alabama Barbour County 98.0 0.690141
7 2013 Alabama Barbour County 108.0 0.675000
8 2014 Alabama Barbour County 102.0 0.733813
9 2015 Alabama Barbour County 115.0 1.173469
10 2016 Alabama Barbour County 118.0 1.092593
11 2017 Alabama Barbour County 114.0 1.117647
但是,现在您在开始时有NAN,并且每个县的前三年都有不正确的值。为了纠正这个问题,我们按州/县分组,然后使用head(3)检索每个组的前三年,然后获取索引值,然后过滤并设置为零
df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
print(df)
Year State County HPI 3 year appreciation
0 2012 Alabama Baldin County 125.0 0.000000
1 2013 Alabama Baldin County 130.0 0.000000
2 2014 Alabama Baldin County 127.5 0.000000
3 2015 Alabama Baldin County 142.0 1.136000
4 2016 Alabama Baldin County 160.0 1.230769
5 2017 Alabama Baldin County 139.0 1.090196
6 2012 Alabama Barbour County 98.0 0.000000
7 2013 Alabama Barbour County 108.0 0.000000
8 2014 Alabama Barbour County 102.0 0.000000
9 2015 Alabama Barbour County 115.0 1.173469
10 2016 Alabama Barbour County 118.0 1.092593
11 2017 Alabama Barbour County 114.0 1.117647
总代码为:
import pandas as pd
state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}
df = pd.DataFrame(data)
df = df.sort_values(['State', 'County', 'Year'])
df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)
df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
下面是一个最简单的例子:
import pandas as pd
# create data
data = {"Year": [2010, 2011, 2012, 2013, 2014]*2,
"State": ["Bama", "Bama", "Bama", "Bama", "Bama",
"NY", "NY", "NY", "NY", "NY"],
"hpi": [100, 105, 110, 115, 120]*2}
data = pd.DataFrame.from_dict(data)
# Create column with 3y shifted hpi
data["hpi_3y"] = data.groupby(["State"])["hpi"].shift(3)
# compute your appreciation value from the columns
data["3y_appreciation"] = 100 + ((data["hpi"] / data["hpi_3y"] - 1) * 100)
data
基本上,您可以按所有相关列进行分组(不包括年份)。然后,将hpi
列中的值移位3行=3年。此后,每次观测在同一行中都有相应的hpi
和hpi_3y
,并且可以直接进行计算
输出:
| Year | State | hpi | hpi_3y | 3y_appreciation |
|-------:|:--------|------:|---------:|------------------:|
| 2010 | Bama | 100 | nan | nan |
| 2011 | Bama | 105 | nan | nan |
| 2012 | Bama | 110 | nan | nan |
| 2013 | Bama | 115 | 100 | 115 |
| 2014 | Bama | 120 | 105 | 114.286 |
| 2010 | NY | 100 | nan | nan |
| 2011 | NY | 105 | nan | nan |
| 2012 | NY | 110 | nan | nan |
| 2013 | NY | 115 | 100 | 115 |
| 2014 | NY | 120 | 105 | 114.286 |
下面是一个最简单的例子:
import pandas as pd
# create data
data = {"Year": [2010, 2011, 2012, 2013, 2014]*2,
"State": ["Bama", "Bama", "Bama", "Bama", "Bama",
"NY", "NY", "NY", "NY", "NY"],
"hpi": [100, 105, 110, 115, 120]*2}
data = pd.DataFrame.from_dict(data)
# Create column with 3y shifted hpi
data["hpi_3y"] = data.groupby(["State"])["hpi"].shift(3)
# compute your appreciation value from the columns
data["3y_appreciation"] = 100 + ((data["hpi"] / data["hpi_3y"] - 1) * 100)
data
基本上,您可以按所有相关列进行分组(不包括年份)。然后,将hpi
列中的值移位3行=3年。此后,每次观测在同一行中都有相应的hpi
和hpi_3y
,并且可以直接进行计算
输出:
| Year | State | hpi | hpi_3y | 3y_appreciation |
|-------:|:--------|------:|---------:|------------------:|
| 2010 | Bama | 100 | nan | nan |
| 2011 | Bama | 105 | nan | nan |
| 2012 | Bama | 110 | nan | nan |
| 2013 | Bama | 115 | 100 | 115 |
| 2014 | Bama | 120 | 105 | 114.286 |
| 2010 | NY | 100 | nan | nan |
| 2011 | NY | 105 | nan | nan |
| 2012 | NY | 110 | nan | nan |
| 2013 | NY | 115 | 100 | 115 |
| 2014 | NY | 120 | 105 | 114.286 |
所以,当我们对其他列进行排序时,它们不会错位,对吗?对,我按照括号中的顺序进行排序,这是解决方案所必需的。州第一,县第二,然后日期,HPI的将跟随。通过这种方式进行排序,我们可以shift()并知道何时获得正确的值。当然,如果缺少一年,则会出现问题。不会缺少年份,但当我应用排序部分时,我不会以我们需要的格式获取数据。我仍然有2012年而不是2013年,etcI认为基于我正在使用的数据集,您的解决方案将不起作用。这就是为什么我认为lambda函数是必要的。所以当我们对其他列进行排序时,它们不会对齐正确吗?正确,我是按照括号中的顺序进行排序,这是解决方案所必需的。州第一,县第二,然后日期,HPI的将跟随。通过这种方式进行排序,我们可以shift()并知道何时获得正确的值。当然,如果缺少一年,则会出现问题。不会缺少年份,但当我应用排序部分时,我不会以我们需要的格式获取数据。我仍然有2012年而不是2013年,etcI认为基于我正在使用的数据集,您的解决方案将不起作用。这就是为什么我认为lambda函数是必要的。为什么我们不在groupby/中包含年份,因为groupby为您指定的列提供了唯一的组合。因此,在我的示例中,您每年都会得到一行并声明。这不是你想要的。你想要唯一的状态和所有年份,这样你就可以以一种有意义的方式计算hpi_3y。你看到我关于数据的最新更新了吗?我问这个问题的原因是因为我的数据比我给出的例子要精细得多。好吧,你的数据真的很难阅读。相反,你应该给我们一个机会。此外,您还可以以更好的可读格式输出pandas数据帧,如。此外,如果您正确理解我的简单示例,您应该能够使其适应更复杂的情况。在大多数情况下,只需在
groupby
函数中添加适当的列就足够了。为什么不在groupby/中包含年份,因为groupby为您提供了指定列的唯一组合。因此,在我的示例中,您每年都会得到一行并声明。这不是你想要的。你想要唯一的状态和所有年份,这样你就可以以一种有意义的方式计算hpi_3y。你看到我关于数据的最新更新了吗?我问这个问题的原因是因为我的数据比我给出的例子要精细得多。好吧,你的数据真的很难阅读。相反,你应该给我们一个机会。此外,您还可以以更好的可读格式输出pandas数据帧,如。此外,如果您正确理解我的简单示例,您应该能够使其适应更复杂的情况。在大多数情况下,只需向groupby
函数添加适当的列就足够了。