pythonic方法在给定字符串中查找数据帧的列值_Python_Pandas_Dataframe

pythonic方法在给定字符串中查找数据帧的列值

python pandas dataframe

pythonic方法在给定字符串中查找数据帧的列值,python,pandas,dataframe,Python,Pandas,Dataframe,我有这样一个熊猫数据框： data={ 'col1':['New Zealand', 'Gym', 'United States'], 'col2':['Republic of South Africa', 'Park', 'United States of America'], } df=pd.DataFrame(data) print(df) col1 col2 0 New Zealand Repub

我有这样一个熊猫数据框：

data={
    'col1':['New Zealand', 'Gym', 'United States'],
    'col2':['Republic of South Africa', 'Park', 'United States of America'],
}
df=pd.DataFrame(data)
print(df)

            col1                      col2
0    New Zealand  Republic of South Africa
1            Gym                      Park
2  United States  United States of America

def find_match(df,sentence):
    "returns true/false depending on the matching value and column name where the value exists"
    arr=[]
    cols=[]
    flag=False
    for i,row in df.iterrows():
        if row['col1'].lower() in sentence.lower():
            arr.append(row['col1'])
            cols.append('col1')
            flag=True
        elif row['col2'].lower() in sentence.lower():
            arr.append(row['col2'])
            cols.append('col2')
            flag=True
    return flag,arr,cols

sentence="I live in the United States"
find_match(df,sentence)  # returns (True, ['United States'], ['col1'])

def find_match(df,sentence):
    ids = [(i,j) for j in df.columns for i,v in enumerate(df[j]) if v.lower() in sentence.lower()]
    return len(ids)>0, [df[id[1]][id[0]] for id in ids], [id[1] for id in ids]

我有一个句子，它可能包含数据框中任何一列的单词。我想得到在给定句子中出现的列中的值，以及它们在哪个列中。我见过一些类似的解决方案，但它们将给定的句子与列值相匹配，而不是相反。目前，我是这样做的：

data={
    'col1':['New Zealand', 'Gym', 'United States'],
    'col2':['Republic of South Africa', 'Park', 'United States of America'],
}
df=pd.DataFrame(data)
print(df)

            col1                      col2
0    New Zealand  Republic of South Africa
1            Gym                      Park
2  United States  United States of America

def find_match(df,sentence):
    "returns true/false depending on the matching value and column name where the value exists"
    arr=[]
    cols=[]
    flag=False
    for i,row in df.iterrows():
        if row['col1'].lower() in sentence.lower():
            arr.append(row['col1'])
            cols.append('col1')
            flag=True
        elif row['col2'].lower() in sentence.lower():
            arr.append(row['col2'])
            cols.append('col2')
            flag=True
    return flag,arr,cols

sentence="I live in the United States"
find_match(df,sentence)  # returns (True, ['United States'], ['col1'])

def find_match(df,sentence):
    ids = [(i,j) for j in df.columns for i,v in enumerate(df[j]) if v.lower() in sentence.lower()]
    return len(ids)>0, [df[id[1]][id[0]] for id in ids], [id[1] for id in ids]

我想用一种更具python风格的方式来实现这一点，因为它在相当大的数据帧上花费了大量的时间，而且对我来说，它似乎并不具有python风格

我不能使用.isin（），因为它需要一个字符串列表，并将列值与给定的整个句子匹配。我也尝试过执行以下操作，但它会引发错误：

df.loc[df['col1'].str.lower() in sentence]  # throws error that df['col1'] should be a string

我们将非常感谢您的帮助。谢谢

我会这样做：

data={
    'col1':['New Zealand', 'Gym', 'United States'],
    'col2':['Republic of South Africa', 'Park', 'United States of America'],
}
df=pd.DataFrame(data)
print(df)

            col1                      col2
0    New Zealand  Republic of South Africa
1            Gym                      Park
2  United States  United States of America

def find_match(df,sentence):
    "returns true/false depending on the matching value and column name where the value exists"
    arr=[]
    cols=[]
    flag=False
    for i,row in df.iterrows():
        if row['col1'].lower() in sentence.lower():
            arr.append(row['col1'])
            cols.append('col1')
            flag=True
        elif row['col2'].lower() in sentence.lower():
            arr.append(row['col2'])
            cols.append('col2')
            flag=True
    return flag,arr,cols

sentence="I live in the United States"
find_match(df,sentence)  # returns (True, ['United States'], ['col1'])

def find_match(df,sentence):
    ids = [(i,j) for j in df.columns for i,v in enumerate(df[j]) if v.lower() in sentence.lower()]
    return len(ids)>0, [df[id[1]][id[0]] for id in ids], [id[1] for id in ids]

其中：

find_match(df, sentence = 'I regularly go to the gym in the United States of America')

(True,
 ['Gym', 'United States', 'United States of America'],
 ['col1', 'col1', 'col2'])

从我的感觉来看，这是一个很好的python方法，尽管可能有更优雅的方法来更多地使用pandas函数。

显然，您希望检查col1中的每个值是否是句子的子字符串。这是正确的吗？如果是，这里有一种方法：

df = pd.DataFrame(
    {'col1': ['New Zealand', 'Gym', 'United States'],
    'col2': ['Republic of South Africa', 'Park', 'United States of America']})

sentence = 'I live in the United States'

mask = df['col1'].apply(lambda x: x in sentence) # `mask` is a boolean array

if mask.any():
    matches = df.loc[mask, 'col1']
    print(mask.any(), end=', ')
    print(df.loc[mask, 'col1'].values, end=', ')
    print('col1')
    print()

# the print statements produce the following line
# True, ['United States'], col1

如果这是一列的正确逻辑，那么您可以将

mask

语句和If子句放入df列中col的循环


更新：我们可以修改lambda表达式以执行不区分大小写的比较。（原始数据帧未更改。）
很好，但它没有给出匹配的确切字符串，对吗？所以您希望解决方案每行只包含一个值？因此，包含gym和park的句子应该只返回gym的位置？我刚刚编辑了答案，以便打印字符串而不是行ID。如果值为小写，则Edit不会返回匹配项。你能告诉我怎么修吗？显然，x.lower（）在屏蔽时不起作用。我将添加到\u lower（）
到lambda表达式中，这对我的示例有效。