在python中删除dataframe中的重复行_Python_Pandas_Dataframe

在python中删除dataframe中的重复行

python pandas dataframe

在python中删除dataframe中的重复行,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框，它有27949行和7列&前几行如下所示任务：在dataframe中，我有一个“title”列，其中有许多我想要删除的重复标题（重复标题：除了1或2个单词外，几乎所有标题都是相同的）。伪代码：我想检查第一行和所有其他行&如果其中任何一行是重复的，我想删除它。然后我想检查第二行和所有其他行&如果其中任何一行是重复的，我想删除它-与所有行类似，即I=1行到最后一行j=I+1到最后一行。我的代码：范围（027950）内的i的：对于范围内的j（127950）： a=数据_排

我有一个数据框，它有27949行和7列&前几行如下所示

任务：在dataframe中，我有一个“title”列，其中有许多我想要删除的重复标题（重复标题：除了1或2个单词外，几乎所有标题都是相同的）。伪代码：我想检查第一行和所有其他行&如果其中任何一行是重复的，我想删除它。然后我想检查第二行和所有其他行&如果其中任何一行是重复的，我想删除它-与所有行类似，即I=1行到最后一行j=I+1到最后一行。我的代码：

范围（027950）内的i的

：
对于范围内的j（127950）：
a=数据_排序['title'].iloc[i].split（）
b=数据_排序['title'].iloc[j].split（）
如果len（a）-len（b）我建议以下方法：
构建标题的差异矩阵，其中i，j元素将表示i'th和j'th标题之间的单词差异
像这样：
    import numpy as np
    from itertools import product

    l = list(data_sorted['title'])

    def diff_words(text_1, text_2):
        # return the number of different words between two texts
        words_1 = text_1.split()
        words_2 = text_2.split()
        diff = max(len(words_1),len(words_2))-len(np.intersect1d(words_1, words_2))
        return diff


    differences = [diff_words(i,j) for i,j in product(l,l)]
    # differences: a flat matrix integers where the i,j element is the word difference between titles i and j

重复的标题是否意味着重复的行？因为如果标题是重复的，而不是行，则可能会导致问题。无论如何，出现位置索引错误的原因是因为您试图将元素放入循环中，设置j=j将不会减少循环的索引范围。如果行位于j（i+1），我添加了j=j bcoz被删除，那么j之后的下一行现在成为第j行，但是j=j将没有效果，并且不需要对j+=1和i+=1执行任何操作，在python中，for循环中的增量是自动的。实际上，在每次迭代中，i+=2和j+=2。我希望我的解释是清楚的。动态地删除数据帧
的行似乎有点奇怪，因为数据帧
在我使用它们时大多只附加数据结构。将使用group_by
和apply
在重复数据消除应用程序中创建新的DataFrame。
    import numpy as np
    from itertools import product

    l = list(data_sorted['title'])

    def diff_words(text_1, text_2):
        # return the number of different words between two texts
        words_1 = text_1.split()
        words_2 = text_2.split()
        diff = max(len(words_1),len(words_2))-len(np.intersect1d(words_1, words_2))
        return diff


    differences = [diff_words(i,j) for i,j in product(l,l)]
    # differences: a flat matrix integers where the i,j element is the word difference between titles i and j