Python 基于pandas中的其他列值比较列值_Python_Numpy_Pandas

Python 基于pandas中的其他列值比较列值

python numpy pandas

Python 基于pandas中的其他列值比较列值,python,numpy,pandas,Python,Numpy,Pandas,我有一个数据帧： import pandas as pd import numpy as np df = pd.DataFrame([['M',2014,'Seth',5], ['M',2014,'Spencer',5], ['M',2014,'Tyce',5], ['F',2014,'Seth',25], ['F',2014,'Spencer',23]],columns =['sex','year','name','nu

我有一个数据帧：

import pandas as pd
import numpy as np

df = pd.DataFrame([['M',2014,'Seth',5],
         ['M',2014,'Spencer',5],
         ['M',2014,'Tyce',5],
         ['F',2014,'Seth',25],
         ['F',2014,'Spencer',23]],columns =['sex','year','name','number'])

print df

我想找出2014年性别最模糊的名字。我已经尝试了很多方法，但还没有任何运气。

不确定你所说的“最性别模糊”是什么意思，但你可以从这个开始

>>> dfy = (df.year == 2014)
>>> dfF = df[(df.sex == 'F') & dfy][['name', 'number']]
>>> dfM = df[(df.sex == 'M') & dfy][['name', 'number']]
>>> pd.merge(dfF, dfM, on=['name'])
      name  number_x  number_y
0     Seth        25         5
1  Spencer        23         5

如果您只想要总数最高的名称，则：

>>> dfT = pd.merge(dfF, dfM, on=['name'])
>>> dfT
      name  number_x  number_y
0     Seth        25         5
1  Spencer        23         5
>>> dfT['total'] = dfT['number_x'] + dfT['number_y']
>>> dfT.sort_values('total', ascending=False).head(1)
   name  number_x  number_y  total
0  Seth        25         5     30

注意：我确实在回答的最后写了一个函数，但为了更好地理解，我决定一部分一部分地运行代码。

获取性别不明确的姓名首先，您需要获得性别不明确名称的列表。我建议使用集合交点：

>>> male_names = df[df.sex == "M"].name
>>> female_names = df[df.sex == "F"].name
>>> gender_ambiguous_names = list(set(male_names).intersection(set(female_names)))

现在，您希望实际将数据子集，以便在2014年仅显示性别不明确的姓名。您可能希望使用成员资格条件并将布尔条件链接为一行：

>>> gender_ambiguous_data_2014 = df[(df.name.isin(gender_ambiguous_names)) & (df.year == 2014)]

聚合数据现在您将其作为

性别\u数据\u 2014

：

>>> gender_ambiguous_data_2014

  sex  year     name  number
0   M  2014     Seth       5
1   M  2014  Spencer       5
3   F  2014     Seth      25
4   F  2014  Spencer      23

然后，您只需按数字进行聚合：

>>> gender_ambiguous_data_2014.groupby('name').number.sum()

name
Seth       30
Spencer    28
Name: number, dtype: int64

正在提取名称现在，你最不想做的就是得到数字最高的名字。但在现实中，你可能有性别模糊的名字，它们的总数相同。我们应该将之前的结果应用于一个新变量

gender\u numbers\u 2014

，并使用它：

>>> gender_ambiguous_numbers_2014 = gender_ambiguous_data_2014.groupby('name').number.sum()
>>> # get the max and find the list of names:
>>> gender_ambiguous_max_2014 = gender_ambiguous_numbers_2014[gender_ambiguous_numbers_2014 == gender_ambiguous_numbers_2014.max()]

现在您可以看到：

>>> gender_ambiguous_max_2014

name
Seth    30
Name: number, dtype: int64

酷，让我们提取索引名吧

>>> gender_ambiguous_max_2014.index
Index([u'Seth'], dtype='object')

等等，这是什么类型的？（提示：它是

pandas.core.index.index

）

没问题，只需应用列表强制：

>>> list(gender_ambiguous_max_2014.index)
['Seth']

让我们把它写在函数中！因此，在本例中，我们的列表只有一个元素。但我们可能想编写一个函数，其中它为唯一的竞争者返回一个字符串，或者如果某些性别不明确的名称在该年的总数相同，则返回一个字符串列表

在下面的包装函数中，我用

ga

缩写了变量名以缩短代码。当然，这是假设数据集的格式与您显示的格式相同，并命名为

df

。如果名称不同，只需相应地更改

df

def get_most_popular_gender_ambiguous_name(year):
    """Get the gender ambiguous name with the most numbers in a certain year.

    Returns:
        a string, or a list of strings

    Note:
        'gender_ambiguous' will be abbreviated as 'ga'
    """
    # get the gender ambiguous names
    male_names = df[df.sex == "M"].name
    female_names = df[df.sex == "F"].name
    ga_names = list(set(male_names).intersection(set(female_names)))
    # filter by year
    ga_data = df[(df.name.isin(ga_names)) & (df.year == year)]
    # aggregate to get total numbers
    ga_total_numbers = ga_data.groupby('name').number.sum()
    # find the max number
    ga_max_number = ga_total_numbers.max()
    # subset the Series to only those that have max numbers
    ga_max_data = ga_total_numbers[
        ga_total_numbers == ga_max_number
    ]
    # get the index (the names) for those satisfying the conditions
    most_popular_ga_names = list(ga_max_data.index)  # list coercion
    # if list only contains one element, return the only element
    if len(most_popular_ga_names) == 1:
        return most_popular_ga_names[0]
    return most_popular_ga_names

现在，调用此函数非常简单：

>>> get_most_popular_gender_ambiguous_name(2014)  # assuming df is dataframe var name
'Seth'

我想找到2014年性别最模糊的名字，上面的数据框只是一个非常大的数据框的一部分。上面的代码不起作用，但我会进一步解释问题，我需要找到2014年男女人数最多的名字。“男女人数最多”是多少？你是说男性和女性的总数吗？然后

df[df.year==2014].groupby（['name']）['number'].sum（）

will doNo，我不想这样做。例如，Pat既是男性又是女性。我想找到一个男性和女性都最多的名字好吧，那么你想要的结果是什么，如果你在问题中用一个作为inputok@Fungie，你真的需要定义你的问题，“男性和女性都最多”是什么？上桅Pat为10只雄性和6只雌性，Sam为8只雄性和8只雌性。你想选择什么？理解一个问题就是半个答案