Python 3.x 基于现有列中的数据创建新列
我有三个层次结构:属性->prov->co。每个属性都有一个部分,即酒店/住宅。我已编写了一个查询,以获取以下各项的计数:Python 3.x 基于现有列中的数据创建新列,python-3.x,pandas,apache-spark-sql,pandas-groupby,Python 3.x,Pandas,Apache Spark Sql,Pandas Groupby,我有三个层次结构:属性->prov->co。每个属性都有一个部分,即酒店/住宅。我已编写了一个查询,以获取以下各项的计数: properties = spark.sql(""" SELECT COUNT(ps.property_id) as property_count, ps.prov_id, c.id as co_id, ps.segment FROM schema.t1 ps INNER
properties = spark.sql("""
SELECT
COUNT(ps.property_id) as property_count,
ps.prov_id,
c.id as co_id,
ps.segment
FROM
schema.t1 ps
INNER JOIN
schema.t2 c
ON c.id = p.co_id
GROUP BY
2,3,4
""")
properties = properties.toPandas()
这为我提供了每个分段、每个省、每个公司的物业总数。根据上述df物业
,我想创建一个新的df,如下所示:
- prov_id,
- prov_segment,
- co_id,
- co_segment
如果pro\u id
中超过50%的属性属于Home
段,则prov\u段应为“Home”,否则应为Core
。
同样地,co_段
应该是Home
如果prov_id
s的>50%属于Home
prov_段,否则它应该是核心
我知道,我可以通过对数据进行分组来获得属性的总数:
prop_total_count = properties.groupby('prov_name')['property_count'].sum()
但是,我不确定如何使用它来创建新的数据帧
示例数据:
属性。显示(6)
:
基于上述情况,我希望得到以下输出:
| prov_id | prov_segment | co_id | co_segment |
|---------|--------------|-------|------------|
| 1 | Home | ABC | Core |
| 9 | Core | ABC | Core |
| 131 | Home | MNM | Home |
| 199 | Home | KJK | Home |
prov_id 1获得一个Home段,因为它有200个Home属性,而核心属性只有10个。prov_id 9获得一个核心部分,因为它有300个核心属性到10个家庭属性
由于该投资组合共有310处核心地产,而住宅地产为210处,因此co_id ABC获得了一个核心部分
prov_id 131和199仅在一个段中,因此该段保留。好的,也许可以用“较短”的方式解决此问题,但这应该是可行的。它依赖于使用每个组的段创建另外两个数据帧(co_id
或prov_id
),然后在最后合并数据帧
对于较旧的pandas
版本,无法将类似于co\u id['co\u segment']]
的系列合并到数据帧,因此出于兼容性目的,我添加了.to\u frame()
函数。使用pandas
版本>=0.25.1
时,允许此操作,并且该函数调用是多余的
NB:此代码假定唯一的段是Home
、Core
和Managed
import pandas as pd
properties = pd.DataFrame(data={'property_count': [10, 200, 300, 10, 100, 200],
'prov_id': [1, 1, 9, 9, 131, 199],
'co_id': ['ABC', 'ABC', 'ABC', 'ABC', 'MNM', 'KJK'],
'segment': ['Core', 'Home', 'Core', 'Home', 'Home', 'Home']})
def get_segment(row):
if row['home_perc'] > 0.5:
return 'Home'
elif row['core_perc'] > 0.5:
return 'Core'
else:
return 'Managed'
def get_grouped_dataframe(properties_df, grouping_col):
id = pd.DataFrame()
id['total'] = properties.groupby(grouping_col)['property_count'].sum()
id['home'] = properties[properties.segment == 'Home'].groupby(grouping_col)['property_count'].sum()
id['core'] = properties[properties.segment == 'Core'].groupby(grouping_col)['property_count'].sum()
id['managed'] = properties[properties.segment == 'Managed'].groupby(grouping_col)['property_count'].sum()
id['home_perc'] = id['home'] / id['total']
id['home_perc'] = id['home_perc'].fillna(0)
id['core_perc'] = id['core'] / id['total']
id['core_perc'] = id['core_perc'].fillna(0)
id['managed_perc'] = id['core'] / id['total']
id['managed_perc'] = id['core_perc'].fillna(0)
id['segment'] = id.apply(get_segment, axis=1)
return id
prov_id = get_grouped_dataframe(properties, 'prov_id')
prov_id.rename(columns={'segment': 'prov_segment'}, inplace=True)
# total home core home_perc core_perc prov_segment
# prov_id
# 1 210 200 10.0 0.952381 0.047619 Home
# 9 310 10 300.0 0.032258 0.967742 Core
# 131 100 100 NaN 1.000000 0.000000 Home
# 199 200 200 NaN 1.000000 0.000000 Home
co_id = get_grouped_dataframe(properties, 'co_id')
co_id.rename(columns={'segment': 'co_segment'}, inplace=True)
# total home core home_perc core_perc co_segment
# co_id
# ABC 520 210 310.0 0.403846 0.596154 Core
# KJK 200 200 NaN 1.000000 0.000000 Home
# MNM 100 100 NaN 1.000000 0.000000 Home
property_segments = properties.drop(columns=['property_count', 'segment']).drop_duplicates()
property_segments = pd.merge(property_segments, prov_id['prov_segment'].to_frame(), on='prov_id')
property_segments = pd.merge(property_segments, co_id['co_segment'].to_frame(), on='co_id')
# prov_id co_id co_segment prov_segment
# 0 1 ABC Core Home
# 1 9 ABC Core Core
# 2 131 MNM Home Home
# 3 199 KJK Home Home
编辑:将重复代码放入函数中,根据注释添加管理的
段。为兼容性目的,将额外的添加到_frame()
。实际上还有一个名为Managed
的附加段。运行时,我收到一个错误:无法将数据帧与Series类型的实例合并。解决方案需要修改吗?在谷歌上快速搜索该错误会导致堆栈溢出。我根据在那里找到的答案更新了代码。现在,它应该可以工作,甚至对你的老熊猫
版本!也许一个新问题比一个评论(?)更好,但我想知道是否有可能将解决方案修改为每天?也就是说,如果我将yyyy\u mm\u dd
列引入示例属性
df。我一直在尝试修改您的解决方案,但到目前为止运气不佳。
import pandas as pd
properties = pd.DataFrame(data={'property_count': [10, 200, 300, 10, 100, 200],
'prov_id': [1, 1, 9, 9, 131, 199],
'co_id': ['ABC', 'ABC', 'ABC', 'ABC', 'MNM', 'KJK'],
'segment': ['Core', 'Home', 'Core', 'Home', 'Home', 'Home']})
def get_segment(row):
if row['home_perc'] > 0.5:
return 'Home'
elif row['core_perc'] > 0.5:
return 'Core'
else:
return 'Managed'
def get_grouped_dataframe(properties_df, grouping_col):
id = pd.DataFrame()
id['total'] = properties.groupby(grouping_col)['property_count'].sum()
id['home'] = properties[properties.segment == 'Home'].groupby(grouping_col)['property_count'].sum()
id['core'] = properties[properties.segment == 'Core'].groupby(grouping_col)['property_count'].sum()
id['managed'] = properties[properties.segment == 'Managed'].groupby(grouping_col)['property_count'].sum()
id['home_perc'] = id['home'] / id['total']
id['home_perc'] = id['home_perc'].fillna(0)
id['core_perc'] = id['core'] / id['total']
id['core_perc'] = id['core_perc'].fillna(0)
id['managed_perc'] = id['core'] / id['total']
id['managed_perc'] = id['core_perc'].fillna(0)
id['segment'] = id.apply(get_segment, axis=1)
return id
prov_id = get_grouped_dataframe(properties, 'prov_id')
prov_id.rename(columns={'segment': 'prov_segment'}, inplace=True)
# total home core home_perc core_perc prov_segment
# prov_id
# 1 210 200 10.0 0.952381 0.047619 Home
# 9 310 10 300.0 0.032258 0.967742 Core
# 131 100 100 NaN 1.000000 0.000000 Home
# 199 200 200 NaN 1.000000 0.000000 Home
co_id = get_grouped_dataframe(properties, 'co_id')
co_id.rename(columns={'segment': 'co_segment'}, inplace=True)
# total home core home_perc core_perc co_segment
# co_id
# ABC 520 210 310.0 0.403846 0.596154 Core
# KJK 200 200 NaN 1.000000 0.000000 Home
# MNM 100 100 NaN 1.000000 0.000000 Home
property_segments = properties.drop(columns=['property_count', 'segment']).drop_duplicates()
property_segments = pd.merge(property_segments, prov_id['prov_segment'].to_frame(), on='prov_id')
property_segments = pd.merge(property_segments, co_id['co_segment'].to_frame(), on='co_id')
# prov_id co_id co_segment prov_segment
# 0 1 ABC Core Home
# 1 9 ABC Core Core
# 2 131 MNM Home Home
# 3 199 KJK Home Home