Python 按州和县划分的房价指数3年升值情况

Python 按州和县划分的房价指数3年升值情况,python,pandas,lambda,group-by,Python,Pandas,Lambda,Group By,我有一个如下所示的数据集: 我想在hpi中找到三年的增值。请注意,hpi处于tract水平,年份范围为2012-2018年 数据集包含了所有的州和县,比我刚才演示的要大得多。我想使用一些类型的group by lambda函数,例如,当我想根据年份、州和县计算hpi的中位数时 medians = (all_data.groupby(['Year', 'state', 'County_name'])['hpi'] .transform(lambda x: x.media

我有一个如下所示的数据集:

我想在hpi中找到三年的增值。请注意,hpi处于tract水平,年份范围为2012-2018年

数据集包含了所有的州和县,比我刚才演示的要大得多。我想使用一些类型的group by lambda函数,例如,当我想根据年份、州和县计算hpi的中位数时

medians = (all_data.groupby(['Year', 'state', 'County_name'])['hpi']
             .transform(lambda x: x.median() if x.notnull().any() else np.nan)
          )
all_data['hpi'] = all_data['hpi'].fillna(medians)

但是我无法为此目的修改上面的代码。非常感谢您的任何建议。

我已在您的数据中添加了一个县,并为巴布尔县的HPI编制了虚构索引:

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)

# Sorting is necessary.
df = df.sort_values(['State', 'County', 'Year'])
print(df)

    Year    State          County    HPI
0   2012  Alabama   Baldin County  125.0
1   2013  Alabama   Baldin County  130.0
2   2014  Alabama   Baldin County  127.5
3   2015  Alabama   Baldin County  142.0
4   2016  Alabama   Baldin County  160.0
5   2017  Alabama   Baldin County  139.0
6   2012  Alabama  Barbour County   98.0
7   2013  Alabama  Barbour County  108.0
8   2014  Alabama  Barbour County  102.0
9   2015  Alabama  Barbour County  115.0
10  2016  Alabama  Barbour County  118.0
11  2017  Alabama  Barbour County  114.0
在此基础上,我们移动“HPI”并进行分割,以给出您正在寻找的数据结果

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0                  NaN
1   2013  Alabama   Baldin County  130.0                  NaN
2   2014  Alabama   Baldin County  127.5                  NaN
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.690141
7   2013  Alabama  Barbour County  108.0             0.675000
8   2014  Alabama  Barbour County  102.0             0.733813
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647
但是,现在您在开始时有NAN,并且每个县的前三年都有不正确的值。为了纠正这个问题,我们按州/县分组,然后使用head(3)检索每个组的前三年,然后获取索引值,然后过滤并设置为零

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0             0.000000
1   2013  Alabama   Baldin County  130.0             0.000000
2   2014  Alabama   Baldin County  127.5             0.000000
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.000000
7   2013  Alabama  Barbour County  108.0             0.000000
8   2014  Alabama  Barbour County  102.0             0.000000
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647
总代码为:

import pandas as pd

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)
df = df.sort_values(['State', 'County', 'Year'])

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0

我已在您的数据中添加了一个县,并为巴布尔县的HPI编制了虚构索引:

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)

# Sorting is necessary.
df = df.sort_values(['State', 'County', 'Year'])
print(df)

    Year    State          County    HPI
0   2012  Alabama   Baldin County  125.0
1   2013  Alabama   Baldin County  130.0
2   2014  Alabama   Baldin County  127.5
3   2015  Alabama   Baldin County  142.0
4   2016  Alabama   Baldin County  160.0
5   2017  Alabama   Baldin County  139.0
6   2012  Alabama  Barbour County   98.0
7   2013  Alabama  Barbour County  108.0
8   2014  Alabama  Barbour County  102.0
9   2015  Alabama  Barbour County  115.0
10  2016  Alabama  Barbour County  118.0
11  2017  Alabama  Barbour County  114.0
在此基础上,我们移动“HPI”并进行分割,以给出您正在寻找的数据结果

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0                  NaN
1   2013  Alabama   Baldin County  130.0                  NaN
2   2014  Alabama   Baldin County  127.5                  NaN
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.690141
7   2013  Alabama  Barbour County  108.0             0.675000
8   2014  Alabama  Barbour County  102.0             0.733813
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647
但是,现在您在开始时有NAN,并且每个县的前三年都有不正确的值。为了纠正这个问题,我们按州/县分组,然后使用head(3)检索每个组的前三年,然后获取索引值,然后过滤并设置为零

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0             0.000000
1   2013  Alabama   Baldin County  130.0             0.000000
2   2014  Alabama   Baldin County  127.5             0.000000
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.000000
7   2013  Alabama  Barbour County  108.0             0.000000
8   2014  Alabama  Barbour County  102.0             0.000000
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647
总代码为:

import pandas as pd

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)
df = df.sort_values(['State', 'County', 'Year'])

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0

下面是一个最简单的例子:

import pandas as pd

# create data
data = {"Year": [2010, 2011, 2012, 2013, 2014]*2,
        "State": ["Bama", "Bama", "Bama", "Bama", "Bama",
                  "NY", "NY", "NY", "NY", "NY"],
        "hpi": [100, 105, 110, 115, 120]*2}
data = pd.DataFrame.from_dict(data)

# Create column with 3y shifted hpi
data["hpi_3y"] = data.groupby(["State"])["hpi"].shift(3)
# compute your appreciation value from the columns
data["3y_appreciation"] = 100 + ((data["hpi"] / data["hpi_3y"] - 1) * 100)
data
基本上,您可以按所有相关列进行分组(不包括年份)。然后,将
hpi
列中的值移位3行=3年。此后,每次观测在同一行中都有相应的
hpi
hpi_3y
,并且可以直接进行计算

输出:

|   Year | State   |   hpi |   hpi_3y |   3y_appreciation |
|-------:|:--------|------:|---------:|------------------:|
|   2010 | Bama    |   100 |      nan |           nan     |
|   2011 | Bama    |   105 |      nan |           nan     |
|   2012 | Bama    |   110 |      nan |           nan     |
|   2013 | Bama    |   115 |      100 |           115     |
|   2014 | Bama    |   120 |      105 |           114.286 |
|   2010 | NY      |   100 |      nan |           nan     |
|   2011 | NY      |   105 |      nan |           nan     |
|   2012 | NY      |   110 |      nan |           nan     |
|   2013 | NY      |   115 |      100 |           115     |
|   2014 | NY      |   120 |      105 |           114.286 |

下面是一个最简单的例子:

import pandas as pd

# create data
data = {"Year": [2010, 2011, 2012, 2013, 2014]*2,
        "State": ["Bama", "Bama", "Bama", "Bama", "Bama",
                  "NY", "NY", "NY", "NY", "NY"],
        "hpi": [100, 105, 110, 115, 120]*2}
data = pd.DataFrame.from_dict(data)

# Create column with 3y shifted hpi
data["hpi_3y"] = data.groupby(["State"])["hpi"].shift(3)
# compute your appreciation value from the columns
data["3y_appreciation"] = 100 + ((data["hpi"] / data["hpi_3y"] - 1) * 100)
data
基本上,您可以按所有相关列进行分组(不包括年份)。然后,将
hpi
列中的值移位3行=3年。此后,每次观测在同一行中都有相应的
hpi
hpi_3y
,并且可以直接进行计算

输出:

|   Year | State   |   hpi |   hpi_3y |   3y_appreciation |
|-------:|:--------|------:|---------:|------------------:|
|   2010 | Bama    |   100 |      nan |           nan     |
|   2011 | Bama    |   105 |      nan |           nan     |
|   2012 | Bama    |   110 |      nan |           nan     |
|   2013 | Bama    |   115 |      100 |           115     |
|   2014 | Bama    |   120 |      105 |           114.286 |
|   2010 | NY      |   100 |      nan |           nan     |
|   2011 | NY      |   105 |      nan |           nan     |
|   2012 | NY      |   110 |      nan |           nan     |
|   2013 | NY      |   115 |      100 |           115     |
|   2014 | NY      |   120 |      105 |           114.286 |


所以,当我们对其他列进行排序时,它们不会错位,对吗?对,我按照括号中的顺序进行排序,这是解决方案所必需的。州第一,县第二,然后日期,HPI的将跟随。通过这种方式进行排序,我们可以shift()并知道何时获得正确的值。当然,如果缺少一年,则会出现问题。不会缺少年份,但当我应用排序部分时,我不会以我们需要的格式获取数据。我仍然有2012年而不是2013年,etcI认为基于我正在使用的数据集,您的解决方案将不起作用。这就是为什么我认为lambda函数是必要的。所以当我们对其他列进行排序时,它们不会对齐正确吗?正确,我是按照括号中的顺序进行排序,这是解决方案所必需的。州第一,县第二,然后日期,HPI的将跟随。通过这种方式进行排序,我们可以shift()并知道何时获得正确的值。当然,如果缺少一年,则会出现问题。不会缺少年份,但当我应用排序部分时,我不会以我们需要的格式获取数据。我仍然有2012年而不是2013年,etcI认为基于我正在使用的数据集,您的解决方案将不起作用。这就是为什么我认为lambda函数是必要的。为什么我们不在groupby/中包含年份,因为groupby为您指定的列提供了唯一的组合。因此,在我的示例中,您每年都会得到一行并声明。这不是你想要的。你想要唯一的状态和所有年份,这样你就可以以一种有意义的方式计算hpi_3y。你看到我关于数据的最新更新了吗?我问这个问题的原因是因为我的数据比我给出的例子要精细得多。好吧,你的数据真的很难阅读。相反,你应该给我们一个机会。此外,您还可以以更好的可读格式输出pandas数据帧,如。此外,如果您正确理解我的简单示例,您应该能够使其适应更复杂的情况。在大多数情况下,只需在
groupby
函数中添加适当的列就足够了。为什么不在groupby/中包含年份,因为groupby为您提供了指定列的唯一组合。因此,在我的示例中,您每年都会得到一行并声明。这不是你想要的。你想要唯一的状态和所有年份,这样你就可以以一种有意义的方式计算hpi_3y。你看到我关于数据的最新更新了吗?我问这个问题的原因是因为我的数据比我给出的例子要精细得多。好吧,你的数据真的很难阅读。相反,你应该给我们一个机会。此外,您还可以以更好的可读格式输出pandas数据帧,如。此外,如果您正确理解我的简单示例,您应该能够使其适应更复杂的情况。在大多数情况下,只需向
groupby
函数添加适当的列就足够了。