Python 对多个条件进行查询_Python_Pandas

Python 对多个条件进行查询

python pandas

Python 对多个条件进行查询,python,pandas,Python,Pandas,我有以下几点建议 columns = ['question_id', 'answer', 'is_correct'] data = [['1','hello','1.0'], ['1','hello', '1.0'], ['1','hello', '1.0'], ['2', 'dog', '0.0'], ['2', 'cat', '1.0'], ['2', 'dog', '0.0'], ['2', 'th

我有以下几点建议

columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0'],
        ['6', '5.5', '1.0'],
        ['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)
df

我想返回一个列表列表。内部列表应包含来自同一问题的两个正确答案（a1和a2）。每个问题id有一个内部列表。问题id中的其他答案可以忽略

边缘案例：

所有答案都正确->然后复制一份即可。见问题_id=1

没有正确答案->然后跳过此问题。例如，输出无。见问题_id=5

只有一个答案是正确的->然后跳过此问题。例如，输出无。见问题_id=5

例如：

[['Paris', 'The capital is Paris'], ['MILAN', 'milano'],...]

我当前的方法输出a1和a2相同。我做错了什么

# This takes around 1min on cpu
def filter(grp):
    is_correct = grp['is_correct'] == 1.0
    if is_correct.any():
        sample = grp.sample()
        a1 = grp['answer'][is_correct].iloc[0]
        a2 = grp['answer'][is_correct].iloc[0]
        n = 6
        _ = 0
        # I will compare a1 and a2 6 times to see if they are the same
        # and if they are the same grap another one for a2... probably not smart
        while _ < n:
          if a1.index == a2.index:
            a2 = grp['answer'][is_correct].iloc[0]
          _ +=1
        return [a1, a2]

data = df.groupby('question_id').apply(filter).to_list()
# Drop None values
data_clean = [x for x in data if x is not None and x[1] is not None]
data_clean

#这需要大约1分钟的cpu时间
def过滤器（grp）：
是否正确=grp[“是否正确”]==1.0
如果是正确的。任何（）
sample=grp.sample（）
a1=grp['answer'][正确].iloc[0]
a2=grp['answer'][正确].iloc[0]
n=6
_ = 0
#我将比较a1和a2 6次，看看它们是否相同
#如果它们是同一个图形，另一个是a2。。。可能不聪明
而
如果a1.index==a2.index：
a2=grp['answer'][正确].iloc[0]
_ +=1
返回[a1，a2]
data=df.groupby（'question_id'）。apply（filter）。to_list（）
#不删除任何值
data_clean=[x表示数据中的x，如果x不是无且x[1]不是无]
数据清理

您可以执行以下操作：

# get groups with at least one correct answer
res = df[df['is_correct'].astype(float).gt(0)].groupby('question_id')['answer'].agg(lambda x: x.head(2).to_list()).to_list()

# filter out groups with only one element
out = [l for l in res if len(l) > 1]
print(out)

输出

[['hello', 'hello'], ['cat', 'the answer is cat'], ['Milan', 'MILAN'], ['The capital is Paris', 'Paris']]

你可以做：

# get groups with at least one correct answer
res = df[df['is_correct'].astype(float).gt(0)].groupby('question_id')['answer'].agg(lambda x: x.head(2).to_list()).to_list()

# filter out groups with only one element
out = [l for l in res if len(l) > 1]
print(out)

输出

[['hello', 'hello'], ['cat', 'the answer is cat'], ['Milan', 'MILAN'], ['The capital is Paris', 'Paris']]

如果您还需要对结果进行洗牌：

def过滤器（g）：答案=g.loc[g.is_correct==1.0，'答案'] #大概我们想要一个随机排列的答案答案=列表（答案.样本（分数=1）） #至少需要一个答案如果len（答案）==0：一无所获 #如果只有一个答案，请重复 elif len（答案）==1：答案=答案*2 返回答案[：2]#答案已经是一个列表，因此可以索引列表（df.groupby（'question_id'）。应用（过滤器））输出：

[['hello'，'hello']，
['cat'，'答案是cat']，
[‘米兰’、‘米兰’]，
[“巴黎”、“巴黎”]，
没有一个
['5.5', '5.5']]

如果您还需要对结果进行洗牌：

[['hello'，'hello']，
['cat'，'答案是cat']，
[‘米兰’、‘米兰’]，
[“巴黎”、“巴黎”]，
没有一个
['5.5', '5.5']]

我只是想让你知道你让我很开心。谢谢你。@Exa很高兴我能帮上忙。我只是想告诉你，你让我很开心。谢谢。@Exa很高兴我能帮上忙。谢谢！非常好的结构，使我更容易理解它！不知何故，它还为我输出了问题id，包括标题问题id：问题id 1['hello'，'hello']2['cat'，'the答案是cat']我正在使用google colab，不确定这是否是原因。感谢您的反馈！我已将

添加到\u list（）

添加到

groupby

行的末尾：这是否提供了您期望的输出？感谢您的快速回复。然后我得到了错误AttributeError:“DataFrame”对象没有属性“to_list”。然而，我注意到我也得到了上面发布的伪代码和df的错误。所以，问题出在我这边，你的代码可以工作。或者（取决于熊猫的版本），你可以做

list（df.groupby（…）

，我认为这是版本独立的。我已经相应地更新了我的答案！另一个注意事项是，在您的示例中，

is\u correct

实际上是一个字符串，因此在继续之前，我需要执行

df.is\u correct=df.is\u correct.astype（float）

。这也可能取决于版本/操作系统。或者，您可以在调用

pd.DataFrame（…）

时指定类型，非常感谢！非常好的结构，使我更容易理解它！不知何故，它还为我输出了问题id，包括标题问题id：问题id 1['hello'，'hello']2['cat'，'the答案是cat']我正在使用google colab，不确定这是否是原因。感谢您的反馈！我已将

添加到\u list（）

添加到

groupby

list（df.groupby（…）

，我认为这是版本独立的。我已经相应地更新了我的答案！另一个注意事项是，在您的示例中，

is\u correct

实际上是一个字符串，因此在继续之前，我需要执行

df.is\u correct=df.is\u correct.astype（float）

。这也可能取决于版本/操作系统。或者，您可以在调用

pd.DataFrame（…）