Python 检查一个单词是否出现在多张表格中_Python_Pandas_Nlp

Python 检查一个单词是否出现在多张表格中

python pandas nlp

Python 检查一个单词是否出现在多张表格中,python,pandas,nlp,Python,Pandas,Nlp,我有以下格式的数据框：示例数据帧： row1:['efcc', 'fficial', 'billiontwits', 'since', 'covid', 'landed'] row2:['when', 'people', 'say', 'the', 'fatality', 'rate', 'of', 'coronavirus', 'is'] row3:['in', 'the', 'coronavirus-induced', 'crisis', 'people', 'are', 'cyvbwx

我有以下格式的数据框：

示例数据帧：

row1:['efcc', 'fficial', 'billiontwits', 'since', 'covid', 'landed']
row2:['when', 'people', 'say', 'the', 'fatality', 'rate', 'of', 'coronavirus', 'is']
row3:['in', 'the', 'coronavirus-induced', 'crisis', 'people', 'are',  'cyvbwx']
row4:['in', 'the', 'be-induced', 'crisis', 'people', 'are',  'cyvbwx']

columns2（sheet_Retrieved_from）包含单词的来源列表：

row1:sheet1
row2:sheet2
row3:sheet3
row4:sheet2

以及一个单词集合，其中包含通过以下代码列出的单词列表：

words_collection=[]
for w in sample.tokenised_text:
   for t in w:
       words_collection.append(t)

和图纸名称来自：

sheetlist=list（set（sample.sheet.to_list（））

然而，如何找到一种正确的方法来迭代数据帧以检查一个单词是否存在于多个表中

基本上，我正在寻找显示以下内容的输出：

import pandas as pd
import numpy as np


df = pd.DataFrame({'sheet':['sheet1', 'sheet2', 'sheet3', 'sheet2'],
    'tokenized_text':[['efcc', 'fficial', 'billiontwits', 'since', 'covid', 'landed'], ['when', 'people', 'say', 'the', 'fatality', 'rate', 'of', 'coronavirus', 'is'], ['in', 'the', 'coronavirus-induced', 'crisis', 'people', 'are',  'cyvbwx'], ['in', 'the', 'be-induced', 'crisis', 'people', 'are',  'cyvbwx']] })

words_collection = ['efcc','fficial','billiontwits','since','covid','landed','in']

# Create a dictionary that maps each sheet to the words it contains
sheets_words = {}

# Go over the rows of the dataframe, and concatenate for each sheet the words in it
for index, row in df.iterrows():
    sheet_id = row['sheet']
    if sheet_id not in sheets_words.keys():
        sheets_words[sheet_id] = set()
    sheets_words[sheet_id] |= set(row['tokenized_text'])

# Create a dictionary that maps for each word from `words_collection` the number of sheets it appears at
word_to_number_of_sheets = { w : 0 for w in words_collection }

# Go over the sheets
for sheet_id, sheet_words in sheets_words.items():
    # For each word in words_collection
    for w in words_collection:
        # Add 1 to the sheet it appears in if it appears in current sheet
        if w in sheet_words:
            word_to_number_of_sheets[w] += 1

word_to_number_of_sheets_as_list = [(k,v) for k,v in word_to_number_of_sheets.items()]

# [('efcc', 1), ('fficial', 1), ('billiontwits', 1), ('since', 1), ('covid', 1), ('landed', 1), ('in', 2)]
print(word_to_number_of_sheets_as_list)

不要将数据作为图像发布。复制有问题的问题。请更新问题以显示您迄今为止的尝试/研究；“那你就陷入困境了。”苏拉杰再看看你的问题，我推断你想要一个元组列表，所以会相应地更新