Python 如何使用pandas中的另一列更新列
我正试图创建一个数据框架,跟踪2010-2016年间公立学校的开学数量Python 如何使用pandas中的另一列更新列,python,pandas,Python,Pandas,我正试图创建一个数据框架,跟踪2010-2016年间公立学校的开学数量 StatusType County 2010 ...2016 OpenYear ClosedYear 1 Closed Alameda 0 0 2005 2015.0 2 Active Alameda 0 0 2006 NaN 3 Closed Alameda 0 0 2008
StatusType County 2010 ...2016 OpenYear ClosedYear
1 Closed Alameda 0 0 2005 2015.0
2 Active Alameda 0 0 2006 NaN
3 Closed Alameda 0 0 2008 2015.0
4 Active Alameda 0 0 2011 NaN
5 Active Alameda 0 0 2011 NaN
6 Active Alameda 0 0 2012 NaN
7 Closed Alameda 0 0 1980 1989.0
8 Active Alameda 0 0 1980 NaN
9 Active Alameda 0 0 1980 NaN
我想更新2010-2016专栏,以跟踪每年开学的学校数量。例如,数据框架中的第一所学校于2005年开学,2015年关闭。迭代器应检查“ClosedYear”列,并将1添加到所有列的“rows”值<2015(20102011…,2014)。如果“ClosedYear”列显示“NaN”,则从“OpenYear”列中的年份开始,将1添加到所有列的“行”值>=“OpenYear”(例如:学校#4,列[20112012…,2016]+1和列[2010]无变化)
我正在考虑使用“apply”将函数应用于数据帧。但这可能不是解决问题的最有效方法。需要帮助找出如何使这项工作!谢谢
额外步骤:
完成计数后,我想按县对年份列进行分组。我倾向于使用“groupby”w/sum函数来汇总每个县每年的开放学校数量。如果有人能在回答上述问题的同时补充这一点,那将非常有帮助
预期产出:
StatusType County 2010 ...2016 OpenYear ClosedYear
1 Closed Alameda 1 0 2005 2015.0
2 Active Alameda 1 1 2006 NaN
3 Closed Alameda 1 0 2008 2015.0
4 Active Alameda 0 1 2011 NaN
5 Active Alameda 0 1 2011 NaN
6 Active Alameda 0 1 2012 NaN
7 Closed Alameda 0 0 1980 1989.0
8 Active Alameda 1 1 1980 NaN
9 Active Alameda 1 1 1980 NaN
我觉得应该有一种方法可以做到这一点,而不必使用
for loop
,但是,我想不出它是atm,所以我的解决方案是:
# Read Example data
from io import StringIO # This only works python 3+
df = pd.read_fwf(StringIO(
"""StatusType County OpenYear ClosedYear
Closed Alameda 2005 2015.0
Active Alameda 2006 NaN
Closed Alameda 2008 2015.0
Active Alameda 2011 NaN
Active Alameda 2011 NaN
Active Alameda 2012 NaN
Closed Alameda 1980 1989.0
Active Alameda 1980 NaN
Active Alameda 1980 NaN"""))
# For each year
for year in range(2010, 2016+1):
# Create a column of 0s
df[str(year)] = 0
# Where the year is between OpenYear and ClosedYear (or closed year is NaN) set it to 1
df.loc[(df['OpenYear'] <= year) & (pd.isna(df['ClosedYear']) | (df['ClosedYear'] >= year)), str(year)] = int(1)
print(df.to_string)
(注:我不太清楚你想用
groupby
做什么)除非真的需要创建这些中间列,否则你可以直接用groupby
和.size
获取计数,这取决于你是否想包括截止年份,把不平等从改为你能增加一个你预期产出的例子吗?我相信你的小组会把一所从2012
开到NaN
的学校算作历年都在开的学校。不用担心,我花了30次努力才把它们弄对。df=df.groupby(['County']).sum()
pub\u s.loc['OpenYear']=year]),year]+=int(1)
StatusType County OpenYear ClosedYear 2010 2011 2012 2013 2014 2015 2016
0 Closed Alameda 2005 2015.0 1 1 1 1 1 1 0
1 Active Alameda 2006 NaN 1 1 1 1 1 1 1
2 Closed Alameda 2008 2015.0 1 1 1 1 1 1 0
3 Active Alameda 2011 NaN 0 1 1 1 1 1 1
4 Active Alameda 2011 NaN 0 1 1 1 1 1 1
5 Active Alameda 2012 NaN 0 0 1 1 1 1 1
6 Closed Alameda 1980 1989.0 0 0 0 0 0 0 0
7 Active Alameda 1980 NaN 1 1 1 1 1 1 1
8 Active Alameda 1980 NaN 1 1 1 1 1 1 1
StatusType County OpenYear ClosedYear
1 Closed Alameda 2005 2015.0
2 Active Alameda 2006 NaN
3 Closed Alameda 2008 2015.0
4 Active Alameda 2011 NaN
5 Active Alameda 2011 NaN
6 Active Alameda 2012 NaN
7 Closed Alameda 1980 1989.0
8 Active Alameda 1980 NaN
9 Active Alameda 1980 NaN
import pandas as pd
year_list = [2010, 2011, 2012, 2013, 2014, 2015, 2016]
df_list = []
for year in year_list:
group = ((df.ClosedYear.isnull()) | (df.ClosedYear >= year)) & (df.OpenYear <= year)
n_schools = df.groupby([group, df.County]).size()[True]
df_list.append(pd.DataFrame({'n_schools':n_schools, 'year': year}))
ndf = pd.concat(df_list)
# n_schools year
#County
#Alameda 5 2010
#Alameda 7 2011
#Alameda 8 2012
#Alameda 8 2013
#Alameda 8 2014
#Alameda 8 2015
#Alameda 6 2016