Python 将重复的数据帧行与特定列的串联值相结合_Python_Pandas_Pandas Groupby

Python 将重复的数据帧行与特定列的串联值相结合

python pandas

Python 将重复的数据帧行与特定列的串联值相结合,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我希望以一种方式合并行，即连接特定列的值，但在我自己的数据集上得到一些意外的结果。这里有一个例子 df = pd.DataFrame({'id':['1', '2', '3', '1', '3', '4', '4', '6', '6'], 'words':['a', 'b', 'c', 'b', 'a', 'a', 'b', 'c', 'a' ]}) df2 = df.groupby('id')['words'].apply(' '.join).reset_ind

我希望以一种方式合并行，即连接特定列的值，但在我自己的数据集上得到一些意外的结果。这里有一个例子

df = pd.DataFrame({'id':['1', '2', '3', '1', '3', '4', '4', '6', '6'],
                'words':['a', 'b', 'c', 'b', 'a', 'a', 'b', 'c', 'a' ]})
df2 = df.groupby('id')['words'].apply(' '.join).reset_index()

df2.head()

结果看起来像这样，这是我想要的，很好

    id  words
0   1   a b
1   2   b
2   3   c a
3   4   a b
4   6   c a

基于单词列的唯一值，并且看起来很好：

df2.words.value_counts()
c a    2
a b    2
b      1
Name: words, dtype: int64

然而，在我自己的大数据集中（这里不能真正复制它），

df2.words.value\u counts（）

的输出会产生类似的结果，我不知道为什么。知道这里出了什么问题吗

df2.words.value_counts()
c a    10
a c    5
a b    10
b a    5
b      1
Name: words, dtype: int64

但应该是这样的：

df2.words.value_counts()
c a    15
a b    10
b      1
Name: words, dtype: int64

这里的值是假的，但我得到的“单词”列的值相同

有什么想法吗

在我看来，最简单的是在

join

函数中对值进行排序，因此

value\u计数

工作正常：

df2 = df.groupby('id')['words'].apply(lambda x: ' '.join(sorted(x))).reset_index()
print (df2)
  id words
0  1   a b
1  2     b
2  3   a c
3  4   a b
4  6   a c

print (df2.words.value_counts())
a b    2
a c    2
b      1
Name: words, dtype: int64