Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/kubernetes/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 基于现有列中的数据创建新列_Python 3.x_Pandas_Apache Spark Sql_Pandas Groupby - Fatal编程技术网

Python 3.x 基于现有列中的数据创建新列

Python 3.x 基于现有列中的数据创建新列,python-3.x,pandas,apache-spark-sql,pandas-groupby,Python 3.x,Pandas,Apache Spark Sql,Pandas Groupby,我有三个层次结构:属性->prov->co。每个属性都有一个部分,即酒店/住宅。我已编写了一个查询,以获取以下各项的计数: properties = spark.sql(""" SELECT COUNT(ps.property_id) as property_count, ps.prov_id, c.id as co_id, ps.segment FROM schema.t1 ps INNER

我有三个层次结构:属性->prov->co。每个属性都有一个部分,即酒店/住宅。我已编写了一个查询,以获取以下各项的计数:

properties = spark.sql("""
    SELECT
        COUNT(ps.property_id) as property_count,
        ps.prov_id,
        c.id as co_id,
        ps.segment
    FROM
        schema.t1 ps
    INNER JOIN
        schema.t2 c
        ON c.id = p.co_id
    GROUP BY
        2,3,4
""")
properties = properties.toPandas()
这为我提供了每个分段、每个省、每个公司的物业总数。根据上述df
物业
,我想创建一个新的df,如下所示:

- prov_id,
- prov_segment,
- co_id,
- co_segment
如果
pro\u id
中超过50%的属性属于
Home
段,则
prov\u段应为“Home”,否则应为
Core
。 同样地,
co_段
应该是
Home
如果
prov_id
s的>50%属于
Home
prov_段,否则它应该是核心

我知道,我可以通过对数据进行分组来获得属性的总数:

prop_total_count = properties.groupby('prov_name')['property_count'].sum()
但是,我不确定如何使用它来创建新的数据帧

示例数据:

属性。显示(6)

基于上述情况,我希望得到以下输出:

| prov_id | prov_segment | co_id | co_segment |
|---------|--------------|-------|------------|
| 1       | Home         | ABC   | Core       |
| 9       | Core         | ABC   | Core       |
| 131     | Home         | MNM   | Home       |
| 199     | Home         | KJK   | Home       |
prov_id 1获得一个Home段,因为它有200个Home属性,而核心属性只有10个。prov_id 9获得一个核心部分,因为它有300个核心属性到10个家庭属性

由于该投资组合共有310处核心地产,而住宅地产为210处,因此co_id ABC获得了一个核心部分


prov_id 131和199仅在一个段中,因此该段保留。

好的,也许可以用“较短”的方式解决此问题,但这应该是可行的。它依赖于使用每个组的段创建另外两个数据帧(
co_id
prov_id
),然后在最后合并数据帧

对于较旧的
pandas
版本,无法将类似于
co\u id['co\u segment']]
的系列合并到数据帧,因此出于兼容性目的,我添加了
.to\u frame()
函数。使用
pandas
版本
>=0.25.1
时,允许此操作,并且该函数调用是多余的

NB:此代码假定唯一的段是
Home
Core
Managed

import pandas as pd

properties = pd.DataFrame(data={'property_count': [10, 200, 300, 10, 100, 200], 
                                'prov_id': [1, 1, 9, 9, 131, 199], 
                                'co_id': ['ABC', 'ABC', 'ABC', 'ABC', 'MNM', 'KJK'], 
                                'segment': ['Core', 'Home', 'Core', 'Home', 'Home', 'Home']})


def get_segment(row):
    if row['home_perc'] > 0.5:
        return 'Home'
    elif row['core_perc'] > 0.5:
        return 'Core'
    else:
        return 'Managed'


def get_grouped_dataframe(properties_df, grouping_col):
    id = pd.DataFrame()
    id['total'] = properties.groupby(grouping_col)['property_count'].sum()
    id['home'] = properties[properties.segment == 'Home'].groupby(grouping_col)['property_count'].sum()
    id['core'] = properties[properties.segment == 'Core'].groupby(grouping_col)['property_count'].sum()
    id['managed'] = properties[properties.segment == 'Managed'].groupby(grouping_col)['property_count'].sum()
    id['home_perc'] = id['home'] / id['total']
    id['home_perc'] = id['home_perc'].fillna(0)
    id['core_perc'] = id['core'] / id['total']
    id['core_perc'] = id['core_perc'].fillna(0)
    id['managed_perc'] = id['core'] / id['total']
    id['managed_perc'] = id['core_perc'].fillna(0)
    id['segment'] = id.apply(get_segment, axis=1)

    return id


prov_id = get_grouped_dataframe(properties, 'prov_id')
prov_id.rename(columns={'segment': 'prov_segment'}, inplace=True)

#          total  home   core  home_perc  core_perc prov_segment
# prov_id                                                  
# 1          210   200   10.0   0.952381   0.047619         Home
# 9          310    10  300.0   0.032258   0.967742         Core
# 131        100   100    NaN   1.000000   0.000000         Home
# 199        200   200    NaN   1.000000   0.000000         Home

co_id = get_grouped_dataframe(properties, 'co_id')
co_id.rename(columns={'segment': 'co_segment'}, inplace=True)

#        total  home   core  home_perc  core_perc co_segment
# co_id                                                  
# ABC      520   210  310.0   0.403846   0.596154       Core
# KJK      200   200    NaN   1.000000   0.000000       Home
# MNM      100   100    NaN   1.000000   0.000000       Home

property_segments = properties.drop(columns=['property_count', 'segment']).drop_duplicates()

property_segments = pd.merge(property_segments, prov_id['prov_segment'].to_frame(), on='prov_id')
property_segments = pd.merge(property_segments, co_id['co_segment'].to_frame(), on='co_id')

#    prov_id co_id co_segment prov_segment
# 0        1   ABC       Core         Home
# 1        9   ABC       Core         Core
# 2      131   MNM       Home         Home
# 3      199   KJK       Home         Home

编辑:将重复代码放入函数中,根据注释添加
管理的
段。为兼容性目的,将额外的
添加到_frame()

实际上还有一个名为
Managed
的附加段。运行时,我收到一个错误:
无法将数据帧与Series类型的实例合并。解决方案需要修改吗?在谷歌上快速搜索该错误会导致堆栈溢出。我根据在那里找到的答案更新了代码。现在,它应该可以工作,甚至对你的老
熊猫
版本!也许一个新问题比一个评论(?)更好,但我想知道是否有可能将解决方案修改为每天?也就是说,如果我将
yyyy\u mm\u dd
列引入示例
属性
df。我一直在尝试修改您的解决方案,但到目前为止运气不佳。
import pandas as pd

properties = pd.DataFrame(data={'property_count': [10, 200, 300, 10, 100, 200], 
                                'prov_id': [1, 1, 9, 9, 131, 199], 
                                'co_id': ['ABC', 'ABC', 'ABC', 'ABC', 'MNM', 'KJK'], 
                                'segment': ['Core', 'Home', 'Core', 'Home', 'Home', 'Home']})


def get_segment(row):
    if row['home_perc'] > 0.5:
        return 'Home'
    elif row['core_perc'] > 0.5:
        return 'Core'
    else:
        return 'Managed'


def get_grouped_dataframe(properties_df, grouping_col):
    id = pd.DataFrame()
    id['total'] = properties.groupby(grouping_col)['property_count'].sum()
    id['home'] = properties[properties.segment == 'Home'].groupby(grouping_col)['property_count'].sum()
    id['core'] = properties[properties.segment == 'Core'].groupby(grouping_col)['property_count'].sum()
    id['managed'] = properties[properties.segment == 'Managed'].groupby(grouping_col)['property_count'].sum()
    id['home_perc'] = id['home'] / id['total']
    id['home_perc'] = id['home_perc'].fillna(0)
    id['core_perc'] = id['core'] / id['total']
    id['core_perc'] = id['core_perc'].fillna(0)
    id['managed_perc'] = id['core'] / id['total']
    id['managed_perc'] = id['core_perc'].fillna(0)
    id['segment'] = id.apply(get_segment, axis=1)

    return id


prov_id = get_grouped_dataframe(properties, 'prov_id')
prov_id.rename(columns={'segment': 'prov_segment'}, inplace=True)

#          total  home   core  home_perc  core_perc prov_segment
# prov_id                                                  
# 1          210   200   10.0   0.952381   0.047619         Home
# 9          310    10  300.0   0.032258   0.967742         Core
# 131        100   100    NaN   1.000000   0.000000         Home
# 199        200   200    NaN   1.000000   0.000000         Home

co_id = get_grouped_dataframe(properties, 'co_id')
co_id.rename(columns={'segment': 'co_segment'}, inplace=True)

#        total  home   core  home_perc  core_perc co_segment
# co_id                                                  
# ABC      520   210  310.0   0.403846   0.596154       Core
# KJK      200   200    NaN   1.000000   0.000000       Home
# MNM      100   100    NaN   1.000000   0.000000       Home

property_segments = properties.drop(columns=['property_count', 'segment']).drop_duplicates()

property_segments = pd.merge(property_segments, prov_id['prov_segment'].to_frame(), on='prov_id')
property_segments = pd.merge(property_segments, co_id['co_segment'].to_frame(), on='co_id')

#    prov_id co_id co_segment prov_segment
# 0        1   ABC       Core         Home
# 1        9   ABC       Core         Core
# 2      131   MNM       Home         Home
# 3      199   KJK       Home         Home