Python 按州和县划分的房价指数3年升值情况_Python_Pandas_Lambda_Group By

Python 按州和县划分的房价指数3年升值情况

python pandas lambda

Python 按州和县划分的房价指数3年升值情况,python,pandas,lambda,group-by,Python,Pandas,Lambda,Group By,我有一个如下所示的数据集：我想在hpi中找到三年的增值。请注意，hpi处于tract水平，年份范围为2012-2018年数据集包含了所有的州和县，比我刚才演示的要大得多。我想使用一些类型的group by lambda函数，例如，当我想根据年份、州和县计算hpi的中位数时 medians = (all_data.groupby(['Year', 'state', 'County_name'])['hpi'] .transform(lambda x: x.media

我有一个如下所示的数据集：

我想在hpi中找到三年的增值。请注意，hpi处于tract水平，年份范围为2012-2018年

数据集包含了所有的州和县，比我刚才演示的要大得多。我想使用一些类型的group by lambda函数，例如，当我想根据年份、州和县计算hpi的中位数时

medians = (all_data.groupby(['Year', 'state', 'County_name'])['hpi']
             .transform(lambda x: x.median() if x.notnull().any() else np.nan)
          )
all_data['hpi'] = all_data['hpi'].fillna(medians)

但是我无法为此目的修改上面的代码。非常感谢您的任何建议。

我已在您的数据中添加了一个县，并为巴布尔县的HPI编制了虚构索引：

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)

# Sorting is necessary.
df = df.sort_values(['State', 'County', 'Year'])
print(df)

    Year    State          County    HPI
0   2012  Alabama   Baldin County  125.0
1   2013  Alabama   Baldin County  130.0
2   2014  Alabama   Baldin County  127.5
3   2015  Alabama   Baldin County  142.0
4   2016  Alabama   Baldin County  160.0
5   2017  Alabama   Baldin County  139.0
6   2012  Alabama  Barbour County   98.0
7   2013  Alabama  Barbour County  108.0
8   2014  Alabama  Barbour County  102.0
9   2015  Alabama  Barbour County  115.0
10  2016  Alabama  Barbour County  118.0
11  2017  Alabama  Barbour County  114.0

在此基础上，我们移动“HPI”并进行分割，以给出您正在寻找的数据结果

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0                  NaN
1   2013  Alabama   Baldin County  130.0                  NaN
2   2014  Alabama   Baldin County  127.5                  NaN
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.690141
7   2013  Alabama  Barbour County  108.0             0.675000
8   2014  Alabama  Barbour County  102.0             0.733813
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647

但是，现在您在开始时有NAN，并且每个县的前三年都有不正确的值。为了纠正这个问题，我们按州/县分组，然后使用head（3）检索每个组的前三年，然后获取索引值，然后过滤并设置为零

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0             0.000000
1   2013  Alabama   Baldin County  130.0             0.000000
2   2014  Alabama   Baldin County  127.5             0.000000
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.000000
7   2013  Alabama  Barbour County  108.0             0.000000
8   2014  Alabama  Barbour County  102.0             0.000000
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647

总代码为：

import pandas as pd

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)
df = df.sort_values(['State', 'County', 'Year'])

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0

我已在您的数据中添加了一个县，并为巴布尔县的HPI编制了虚构索引：

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)

# Sorting is necessary.
df = df.sort_values(['State', 'County', 'Year'])
print(df)

    Year    State          County    HPI
0   2012  Alabama   Baldin County  125.0
1   2013  Alabama   Baldin County  130.0
2   2014  Alabama   Baldin County  127.5
3   2015  Alabama   Baldin County  142.0
4   2016  Alabama   Baldin County  160.0
5   2017  Alabama   Baldin County  139.0
6   2012  Alabama  Barbour County   98.0
7   2013  Alabama  Barbour County  108.0
8   2014  Alabama  Barbour County  102.0
9   2015  Alabama  Barbour County  115.0
10  2016  Alabama  Barbour County  118.0
11  2017  Alabama  Barbour County  114.0

在此基础上，我们移动“HPI”并进行分割，以给出您正在寻找的数据结果

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0                  NaN
1   2013  Alabama   Baldin County  130.0                  NaN
2   2014  Alabama   Baldin County  127.5                  NaN
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.690141
7   2013  Alabama  Barbour County  108.0             0.675000
8   2014  Alabama  Barbour County  102.0             0.733813
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0
print(df)

    Year    State          County    HPI  3 year appreciation
0   2012  Alabama   Baldin County  125.0             0.000000
1   2013  Alabama   Baldin County  130.0             0.000000
2   2014  Alabama   Baldin County  127.5             0.000000
3   2015  Alabama   Baldin County  142.0             1.136000
4   2016  Alabama   Baldin County  160.0             1.230769
5   2017  Alabama   Baldin County  139.0             1.090196
6   2012  Alabama  Barbour County   98.0             0.000000
7   2013  Alabama  Barbour County  108.0             0.000000
8   2014  Alabama  Barbour County  102.0             0.000000
9   2015  Alabama  Barbour County  115.0             1.173469
10  2016  Alabama  Barbour County  118.0             1.092593
11  2017  Alabama  Barbour County  114.0             1.117647

总代码为：

import pandas as pd

state = ["Alabama"] * 12
county = ["Baldin County"] * 6 + ["Barbour County"] * 6
year = [x for y in range(2) for x in range(2012, 2018)]
hpi = [125, 130, 127.5, 142, 160, 139, 98, 108, 102, 115, 118, 114]
data = {"Year": year, "State": state, "County": county, "HPI": hpi}

df = pd.DataFrame(data)
df = df.sort_values(['State', 'County', 'Year'])

df["3 year appreciation"] = df.HPI / df['HPI'].shift(3)

df.loc[df.groupby(["State", "County"]).head(3).index, "3 year appreciation"] = 0

下面是一个最简单的例子：

import pandas as pd

# create data
data = {"Year": [2010, 2011, 2012, 2013, 2014]*2,
        "State": ["Bama", "Bama", "Bama", "Bama", "Bama",
                  "NY", "NY", "NY", "NY", "NY"],
        "hpi": [100, 105, 110, 115, 120]*2}
data = pd.DataFrame.from_dict(data)

# Create column with 3y shifted hpi
data["hpi_3y"] = data.groupby(["State"])["hpi"].shift(3)
# compute your appreciation value from the columns
data["3y_appreciation"] = 100 + ((data["hpi"] / data["hpi_3y"] - 1) * 100)
data

基本上，您可以按所有相关列进行分组（不包括年份）。然后，将

hpi

列中的值移位3行=3年。此后，每次观测在同一行中都有相应的

hpi

和

hpi_3y

，并且可以直接进行计算

输出：

|   Year | State   |   hpi |   hpi_3y |   3y_appreciation |
|-------:|:--------|------:|---------:|------------------:|
|   2010 | Bama    |   100 |      nan |           nan     |
|   2011 | Bama    |   105 |      nan |           nan     |
|   2012 | Bama    |   110 |      nan |           nan     |
|   2013 | Bama    |   115 |      100 |           115     |
|   2014 | Bama    |   120 |      105 |           114.286 |
|   2010 | NY      |   100 |      nan |           nan     |
|   2011 | NY      |   105 |      nan |           nan     |
|   2012 | NY      |   110 |      nan |           nan     |
|   2013 | NY      |   115 |      100 |           115     |
|   2014 | NY      |   120 |      105 |           114.286 |

下面是一个最简单的例子：

import pandas as pd

# create data
data = {"Year": [2010, 2011, 2012, 2013, 2014]*2,
        "State": ["Bama", "Bama", "Bama", "Bama", "Bama",
                  "NY", "NY", "NY", "NY", "NY"],
        "hpi": [100, 105, 110, 115, 120]*2}
data = pd.DataFrame.from_dict(data)

# Create column with 3y shifted hpi
data["hpi_3y"] = data.groupby(["State"])["hpi"].shift(3)
# compute your appreciation value from the columns
data["3y_appreciation"] = 100 + ((data["hpi"] / data["hpi_3y"] - 1) * 100)
data

基本上，您可以按所有相关列进行分组（不包括年份）。然后，将

hpi

列中的值移位3行=3年。此后，每次观测在同一行中都有相应的

hpi

和

hpi_3y

，并且可以直接进行计算

输出：

|   Year | State   |   hpi |   hpi_3y |   3y_appreciation |
|-------:|:--------|------:|---------:|------------------:|
|   2010 | Bama    |   100 |      nan |           nan     |
|   2011 | Bama    |   105 |      nan |           nan     |
|   2012 | Bama    |   110 |      nan |           nan     |
|   2013 | Bama    |   115 |      100 |           115     |
|   2014 | Bama    |   120 |      105 |           114.286 |
|   2010 | NY      |   100 |      nan |           nan     |
|   2011 | NY      |   105 |      nan |           nan     |
|   2012 | NY      |   110 |      nan |           nan     |
|   2013 | NY      |   115 |      100 |           115     |
|   2014 | NY      |   120 |      105 |           114.286 |

所以，当我们对其他列进行排序时，它们不会错位，对吗？对，我按照括号中的顺序进行排序，这是解决方案所必需的。州第一，县第二，然后日期，HPI的将跟随。通过这种方式进行排序，我们可以shift（）并知道何时获得正确的值。当然，如果缺少一年，则会出现问题。不会缺少年份，但当我应用排序部分时，我不会以我们需要的格式获取数据。我仍然有2012年而不是2013年，etcI认为基于我正在使用的数据集，您的解决方案将不起作用。这就是为什么我认为lambda函数是必要的。所以当我们对其他列进行排序时，它们不会对齐正确吗？正确，我是按照括号中的顺序进行排序，这是解决方案所必需的。州第一，县第二，然后日期，HPI的将跟随。通过这种方式进行排序，我们可以shift（）并知道何时获得正确的值。当然，如果缺少一年，则会出现问题。不会缺少年份，但当我应用排序部分时，我不会以我们需要的格式获取数据。我仍然有2012年而不是2013年，etcI认为基于我正在使用的数据集，您的解决方案将不起作用。这就是为什么我认为lambda函数是必要的。为什么我们不在groupby/中包含年份，因为groupby为您指定的列提供了唯一的组合。因此，在我的示例中，您每年都会得到一行并声明。这不是你想要的。你想要唯一的状态和所有年份，这样你就可以以一种有意义的方式计算hpi_3y。你看到我关于数据的最新更新了吗？我问这个问题的原因是因为我的数据比我给出的例子要精细得多。好吧，你的数据真的很难阅读。相反，你应该给我们一个机会。此外，您还可以以更好的可读格式输出pandas数据帧，如。此外，如果您正确理解我的简单示例，您应该能够使其适应更复杂的情况。在大多数情况下，只需在

groupby

函数中添加适当的列就足够了。为什么不在groupby/中包含年份，因为groupby为您提供了指定列的唯一组合。因此，在我的示例中，您每年都会得到一行并声明。这不是你想要的。你想要唯一的状态和所有年份，这样你就可以以一种有意义的方式计算hpi_3y。你看到我关于数据的最新更新了吗？我问这个问题的原因是因为我的数据比我给出的例子要精细得多。好吧，你的数据真的很难阅读。相反，你应该给我们一个机会。此外，您还可以以更好的可读格式输出pandas数据帧，如。此外，如果您正确理解我的简单示例，您应该能够使其适应更复杂的情况。在大多数情况下，只需向

groupby

函数添加适当的列就足够了。