Python 在考虑DataFrame中的多个列时，对组迭代操作_Python_Pandas

Python 在考虑DataFrame中的多个列时，对组迭代操作

python pandas

Python 在考虑DataFrame中的多个列时，对组迭代操作,python,pandas,Python,Pandas,我有一个数据帧： raw_data = {'cities': ['LA', 'LA', 'LA', 'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Boston', 'Boston', 'Boston', 'Boston', 'Boston'], 'location': ['pub', 'dive', 'club', 'disco', 'cinema', 'cafe', 'diner', 'bowling','supermarket',

我有一个数据帧：

raw_data = {'cities': ['LA', 'LA', 'LA', 'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Boston', 'Boston', 'Boston', 'Boston', 'Boston'], 
        'location': ['pub', 'dive', 'club', 'disco', 'cinema', 'cafe', 'diner', 'bowling','supermarket', 'pizza', 'icecream', 'music'], 
        'distance': ['0', '50', '100', '5', '75', '300', '20', '40', '70', '400', '2000', '2'], 
        'score': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['cities', 'location', 'distance', 'score'])
df

现在我正在尝试编写一个循环，这样对于每个城市，具有最高“分数”的“位置”将在迭代的“距离”窗口中返回。

df['distance'] = pd.to_numeric(df['distance'])
df['bin100'] = pd.cut(df['distance'], np.arange(0, 2001, 100), include_lowest=True, labels=False)
df = df.iloc[df.groupby(['cities', 'bin100'], sort=False)['score'].idxmax(), :-1]

即每100个单元得分最高的位置

如何编写循环来完成此操作

所需输出：

df['distance'] = pd.to_numeric(df['distance'])
df['bin100'] = pd.cut(df['distance'], np.arange(0, 2001, 100), include_lowest=True, labels=False)
df = df.iloc[df.groupby(['cities', 'bin100'], sort=False)['score'].idxmax(), :-1]

我想你想要的是：

df['distance']=df['distance'].astype（int）

但我不知道如何保持距离列与这些城市/分数值相对应，并按这些距离排序

您可以创建一个假列，将距离分组在每100个单位的范围内。我首先设置值为0的任何距离值为1，然后除以100并使用numpy ceil进行取整，得到一个整数范围，例如，0到100公里之间的任何距离都将分组在一起（假列中的值为1），然后我按城市和假列分组，取每组分数的最大索引，并在原始数据帧中找到它。最后，我们不希望在最终输出中出现伪列，因此我使用iloc（：-1）将所有列切分到最后一列：

这里有一条路

#df.distance=pd.to_numeric(df.distance)
df.sort_values('score').groupby([df.cities,pd.cut(df.distance,range(0,1000,100))]).tail(1).sort_index()
     cities  location  distance  score
1        LA      dive        50     94
5   Chicago      cafe       300     25
6   Chicago     diner        20     94
9    Boston     pizza       400     70
10   Boston  icecream      2000     62
11   Boston     music         2     70

您可以这样做：

lS=df.groupby(['cities'])['score'].idxmax().tolist()
lD=(df.groupby(['cities'])['distance'].apply(lambda x:x>100))
df2=df.loc[lS].append(df[lD]).drop_duplicates().sort_values(['cities'],ascending=False).reset_index(drop=True)

    cities  location    distance    score
0   LA      dive        50          94
1   Chicago diner       20          94
2   Chicago cafe        300         25
3   Boston  pizza       400         70
4   Boston  icecream    2000        62

输出：

lS=df.groupby(['cities'])['score'].idxmax().tolist()
lD=(df.groupby(['cities'])['distance'].apply(lambda x:x>100))
df2=df.loc[lS].append(df[lD]).drop_duplicates().sort_values(['cities'],ascending=False).reset_index(drop=True)

    cities  location    distance    score
0   LA      dive        50          94
1   Chicago diner       20          94
2   Chicago cafe        300         25
3   Boston  pizza       400         70
4   Boston  icecream    2000        62

解决方案

这似乎起作用了：

df['distance'] = pd.to_numeric(df['distance'])
df['bin100'] = pd.cut(df['distance'], np.arange(0, 2001, 100), include_lowest=True, labels=False)
df = df.iloc[df.groupby(['cities', 'bin100'], sort=False)['score'].idxmax(), :-1]

感谢@manwithfewneds提供了此处应用的逻辑。

df['distance'] = pd.to_numeric(df['distance'])
df['bin100'] = pd.cut(df['distance'], np.arange(0, 2001, 100), include_lowest=True, labels=False)
df = df.iloc[df.groupby(['cities', 'bin100'], sort=False)['score'].idxmax(), :-1]

对于某些数据帧，

df.loc

可能是避免索引越界错误所必需的：

df = df.loc[df.groupby(['cities', 'bin100'], sort=False)['score'].idxmax()]

LA dive不应该被选择，因为它有94分吗？是的-拼写错误，会被修复。其中一半的距离不是大于100吗？是的，我的意思是100分的窗口，即每100个单位得分最高的位置。因此，如果洛杉矶俱乐部的距离是125，那么潜水和俱乐部将返回（假设100个单位的窗口从0开始）。这非常接近-如何修改它以返回一个新的df？就像每个人都有一个新的df？或者这些结果都在一个df中，另一列为window？一个新的df，包含每个窗口得分最高的所有行，就像输出一样，但不是打印语句。上面“output:”下的代码正是我想要的，但是作为一个新的df。如果您愿意，它显然可以将代码分割成段。但是它可以作为一个长的单行程序工作。lolI可能缺少一些东西，但是我认为您上次编辑的“new=”现在返回了原始df的每一行。输出是正确的！你介意再解释一下这里的方法吗？这两句话我很难理解。我尽力解释。如果您愿意，您可以剖析我编写的代码，并在控制台中一点一点地查看发生了什么。做得好，修复得好。我本来打算键入一个可能的修复方法，但看起来您得到了itiloc是基于纯整数索引的。您可以按整数根据行和列各自的索引获取行和列。Loc是基于标签的。因此，您可以指定列标签而不是其整数索引。iloc和loc都可以对行值使用布尔索引。换句话说，如果条件为true，则将返回数据帧的一个片段。在my df中，此行似乎正在删除距离值为0的每一行。此返回：IndexingError：作为索引器提供的不可对齐的布尔序列（布尔序列的索引和索引对象的索引不匹配