Python 如何使用pandas将字符串与数据帧中的字符串进行比较？_Python_Pandas

Python 如何使用pandas将字符串与数据帧中的字符串进行比较？

python pandas

Python 如何使用pandas将字符串与数据帧中的字符串进行比较？,python,pandas,Python,Pandas,假设我有一个字符串存储在文本中。我想将此字符串与存储在数据框中的字符串列表进行比较，并检查文本是否包含诸如car、plane等词。对于找到的每个关键字，我想添加一个属于相关主题的值 | topic | keywords | |------------|-------------------------------------------| | Vehicles | [car, plane, motorcycle, b

假设我有一个字符串存储在文本中。我想将此字符串与存储在数据框中的字符串列表进行比较，并检查文本是否包含诸如car、plane等词。对于找到的每个关键字，我想添加一个属于相关主题的值

| topic      | keywords                                  |
|------------|-------------------------------------------|
| Vehicles   | [car, plane, motorcycle, bus]             |
| Electronic | [television, radio, computer, smartphone] |
| Fruits     | [apple, orange, grape]                    |

我已经写了下面的代码，但我真的不喜欢它。但它并没有按预期的那样工作

def foo(text, df_lex):

    keyword = []
    score = []
    for lex_list in df_lex['keyword']:
        print(lex_list)
        val = 0
        for lex in lex_list:

            if lex in text:
                val =+ 1
        keyword.append(key)
        score.append(val)
    score_list = pd.DataFrame({
    'keyword':keyword,
    'score':score
    })

有没有办法有效地做到这一点？我不喜欢在我的程序中有太多的循环，因为它们似乎不是很有效。如果需要，我会详细说明。多谢各位

编辑：例如，我的文本是这样的。我说得很简单，只是为了让大家理解

今天我骑摩托车去展厅买了一辆车。不幸的是，当我检查我的智能手机时，我收到了一条回家的信息

所以，我的预期输出是这样的

| topic      | score |
|------------|-------|
| Vehicles   | 2     |
| Electronic | 1     |
| Fruits     | 0     |

EDIT2：在@jezrael的帮助下，我终于找到了自己的解决方案

df['keywords'] = df['keywords'].str.strip('[]').str.split(', ')

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

score_list = []
for lex in df['keywords']:
    val = 0
    for w in lex:
        if w in text:
            val +=1
    score_list.append(val)
df['score'] = score_list
print(df)

它可以准确地打印我需要的内容。

使用re.findall提取单词，转换为小写，然后转换为集合，最后在列表理解中获取匹配集合的长度：

df = pd.DataFrame({'topic': ['Vehicles', 'Electronic', 'Fruits'], 'keywords': [['car', 'plane', 'motorcycle', 'bus'], ['television', 'radio', 'computer', 'smartphone'], ['apple', 'orange', 'grape']]})

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

df['score'] = [sum(z in text.split() for z in x) for x in df['keywords']]

替代解决方案是在列表理解中仅计算真值：

df = pd.DataFrame({'topic': ['Vehicles', 'Electronic', 'Fruits'], 'keywords': [['car', 'plane', 'motorcycle', 'bus'], ['television', 'radio', 'computer', 'smartphone'], ['apple', 'orange', 'grape']]})

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

df['score'] = [sum(z in text.split() for z in x) for x in df['keywords']]

使用re.findall提取单词，转换为小写，然后转换为集合，最后获取列表理解中匹配集合的长度：

df = pd.DataFrame({'topic': ['Vehicles', 'Electronic', 'Fruits'], 'keywords': [['car', 'plane', 'motorcycle', 'bus'], ['television', 'radio', 'computer', 'smartphone'], ['apple', 'orange', 'grape']]})

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

df['score'] = [sum(z in text.split() for z in x) for x in df['keywords']]

替代解决方案是在列表理解中仅计算真值：

df = pd.DataFrame({'topic': ['Vehicles', 'Electronic', 'Fruits'], 'keywords': [['car', 'plane', 'motorcycle', 'bus'], ['television', 'radio', 'computer', 'smartphone'], ['apple', 'orange', 'grape']]})

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

df['score'] = [sum(z in text.split() for z in x) for x in df['keywords']]

这里有两种仅使用普通python的替代方法。首先是感兴趣的数据

kwcsv = """topic, keywords
Vehicles, car, plane, motorcycle, bus
Electronic, television, radio, computer, smartphone
Fruits, apple, orange, grape
"""

test = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'
testr = test
from io import StringIO

StringIO只用于制作可运行的示例，它象征着读取文件。然后构造一个kwords字典用于计数

import csv

kwords = dict()
#with open('your_file.csv') as mcsv:
mcsv = StringIO(kwcsv)
reader = csv.reader(mcsv, skipinitialspace=True)
next(reader, None) # skip header
for row in reader:
    kwords[row[0]] = tuple(row[1:])

现在我们有了在字典里计算的东西。第一种方法是在文本字符串中进行计数

for r in list('.,'): # remove chars that removes counts
    testr = testr.replace(r, '')

result = {k: sum((testr.count(w) for w in v)) for k, v in kwords.items()}

或者使用正则表达式拆分字符串和计数器的另一个版本

import re
from collections import Counter

words = re.findall(r'\w+', StringIO(test).read().lower())
count = Counter(words)

result2 = {k: sum((count[w] for w in v)) for k, v in kwords.items()}

并不是说其中任何一个都更好，只是使用普通python的替代方案。就我个人而言，我会使用re/Counter版本。

这里有两种仅使用普通python的替代方法。首先是感兴趣的数据

kwcsv = """topic, keywords
Vehicles, car, plane, motorcycle, bus
Electronic, television, radio, computer, smartphone
Fruits, apple, orange, grape
"""

test = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'
testr = test
from io import StringIO

StringIO只用于制作可运行的示例，它象征着读取文件。然后构造一个kwords字典用于计数

import csv

kwords = dict()
#with open('your_file.csv') as mcsv:
mcsv = StringIO(kwcsv)
reader = csv.reader(mcsv, skipinitialspace=True)
next(reader, None) # skip header
for row in reader:
    kwords[row[0]] = tuple(row[1:])

现在我们有了在字典里计算的东西。第一种方法是在文本字符串中进行计数

for r in list('.,'): # remove chars that removes counts
    testr = testr.replace(r, '')

result = {k: sum((testr.count(w) for w in v)) for k, v in kwords.items()}

或者使用正则表达式拆分字符串和计数器的另一个版本

import re
from collections import Counter

words = re.findall(r'\w+', StringIO(test).read().lower())
count = Counter(words)

result2 = {k: sum((count[w] for w in v)) for k, v in kwords.items()}

并不是说其中任何一个都更好，只是使用普通python的替代方案。就个人而言，我会使用re/计数器版本。

您当前的代码认为scare与汽车匹配。你到底想要不想要那种行为？如果不是的话，你想让汽车匹配汽车堵塞还是这不重要？@JohnZwinck不，我不想那样。我只想把车和车配对，而不是汽车。我的实际数据集不是英文的。这里不存在汽车中的这种复数形式。您使用pandas而不是vanilla python来实现这种功能有什么具体原因吗？在标准库中，有一些很好的工具可以轻松实现这些功能used@jezrael如果你所说的“好车”和“好车”这样的关键词是什么意思，是的，两者都有可能。你所说的预期输出中的文本是什么意思？我的预期输出是这些单词在关键字中出现的分数。我希望这足够清楚？@ahed87我使用pandas，因为我从csv或txt加载数据作为数据帧。这是个坏主意吗？我非常习惯使用熊猫从csv之类的文档加载数据。你推荐什么工具？你当前的代码认为scare与汽车匹配。你到底想要不想要那种行为？如果不是的话，你想让汽车匹配汽车堵塞还是这不重要？@JohnZwinck不，我不想那样。我只想把车和车配对，而不是汽车。我的实际数据集不是英文的。这里不存在汽车中的这种复数形式。您使用pandas而不是vanilla python来实现这种功能有什么具体原因吗？在标准库中，有一些很好的工具可以轻松实现这些功能used@jezrael如果你所说的“好车”和“好车”这样的关键词是什么意思，是的，两者都有可能。你所说的预期输出中的文本是什么意思？我的预期输出是这些单词在关键字中出现的分数。我希望这足够清楚？@ahed87我使用pandas，因为我从csv或txt加载数据作为数据帧。这是个坏主意吗？我非常习惯使用熊猫从csv之类的文档加载数据。你推荐什么工具？谢谢。但我不知道为什么当我运行这段代码时，我在车辆上得到了1个，在电子设备上得到了2个，在水果上得到了1个。@AnnaRG-添加了示例数据框，你能检查一下吗？因为这真的很奇怪-水果是1是的，我已经运行了你的代码并得到水果为1，这没有意义，因为在文本中找不到水果的关键字。@AnnaRG所以使用df['keywords']=df['keywords'].str.strip'[]'.str.split'，或者导入ast和df['keywords']=df['keywords'].applyast.literal_eval@Anna所以它就像df['score']=[sumz i

n文本代表z在x中代表x在df中代表x['keywords']]谢谢。但我不知道为什么当我运行这段代码时，我在车辆上得到了1个，在电子设备上得到了2个，在水果上得到了1个。@AnnaRG-添加了示例数据框，你能检查一下吗？因为这真的很奇怪-水果是1是的，我已经运行了你的代码并得到水果为1，这没有意义，因为在文本中找不到水果的关键字。@AnnaRG所以使用df['keywords']=df['keywords'].str.strip'[]'.str.split'，或者导入ast和df['keywords']=df['keywords'].applyast.literal_eval@Anna所以它就像df['score']=[sumz在文本中表示z在x中表示x在df中表示x['keywords']]