Python Pandas dataframe-如何消除列中的重复字_Python_Pandas_Dataframe

Python Pandas dataframe-如何消除列中的重复字

python pandas dataframe

Python Pandas dataframe-如何消除列中的重复字,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框： import pandas as pd df = pd.DataFrame({'category':[0,1,2], 'text': ['this is some text for the first row', 'second row has this text', 'third row this is the text']}

我有一个数据框：

import pandas as pd

df = pd.DataFrame({'category':[0,1,2],
                   'text': ['this is some text for the first row',
                            'second row has this text',
                            'third row this is the text']})
df.head()

我希望得到以下结果（每行不重复单词）：

预期结果（针对上述示例）：

使用以下代码，我尝试将行中的所有数据转换为字符串：

final_list =[]
for index, rows in df.iterrows():
    # Create list for the current row
    my_list =rows.text
    # append the list to the final list
    final_list.append(my_list)
# Print the list
print(final_list)
text=''

for i in range(len(final_list)):
    text+=final_list[i]+', '

print(text)

这个问题（）中的想法无助于我获得预期的结果

arr = [set(x.split()) for x in text.split(',')]
mutual_words = set.intersection(*arr)
result = [list(x.difference(mutual_words)) for x in arr]
result = sum(result, [])
final_text = (", ").join(result)
print(final_text)

有人知道如何获得它吗？

您可以使用

Series.str.split

将列

文本

围绕分隔符空间拆分，然后使用

reduce

获得所有行中找到的单词的交集，最后使用

str.replace

删除常用单词：

from functools import reduce

w = reduce(lambda x, y: set(x) & set(y), df['text'].str.split())
df['text'] = df['text'].str.replace(rf"(\s*)(?:{'|'.join(w)})\s*", r'\1').str.strip()

这听起来像个问题。为什么要这样做？我需要这些唯一的值来进行进一步的NLP处理。实际上，您只需删除

this

，

和行，即可获得所需的输出。您的预期结果将删除出现在多个列中的所有单词副本，但第0行和第2行中重复了“the”。你如何决定你是否应该保留一个单词？以上只是一个例子。不仅仅是这件事，还有争吵。我最初的数据框架大约有20000字。
from functools import reduce

w = reduce(lambda x, y: set(x) & set(y), df['text'].str.split())
df['text'] = df['text'].str.replace(rf"(\s*)(?:{'|'.join(w)})\s*", r'\1').str.strip()

   category                    text
0         0   is some for the first
1         1              second has
2         2            third is the