pandas根据另一列中的条件从一列中提取公共子字符串_Pandas

pandas根据另一列中的条件从一列中提取公共子字符串

pandas

pandas根据另一列中的条件从一列中提取公共子字符串,pandas,Pandas,我有一个像这样的数据框 code description col3 col4 123456 nice shoes size4 something something 123456 nice shoes size5 something something 567890 boots size 1 something something 567890 boots size 2 something

我有一个像这样的数据框

code    description          col3        col4
123456  nice shoes size4     something   something
123456  nice shoes size5     something   something
567890  boots size 1         something   something
567890  boots size 2         something   something
567890  boots size 3         something   something
234567 baby overall 2yrs     something.  something
234567 baby overall 3-4yrs     something  something
456778 shirt m     Something.   Something
456778 shirt l     something    Something
456778 shirt xl    Something   Something

我喜欢将“description”缩短为基于类似“code”列的公共子字符串。并删除重复项

code    description          col3        col4
123456  nice shoes          something   something
567890  boots               something   something
234567 baby overall    something    something
456778 shirt              Something   Something

我怀疑需要分组，可能需要应用一个函数，但我无法理解这一点。找到一个函数，但该函数包含2个字符串。不知道它是否可能会有帮助。这个函数只需要2个字符串，而我的数据可能有5行代码相同

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"

def extract_common(string1, string2):
    match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

    print(match)  # -> Match(a=0, b=15, size=9)
    print(string1[match.a: match.a + match.size])  # -> apple pie
    print(string2[match.b: match.b + match.size])  # -> apple pie
    return string1[match.a: match.a + match.size]

感谢您的帮助

您需要熊猫0.25.1才能使用

对不起，我认为数据集不够有代表性。某些行可能没有size关键字。对于服装产品，尺寸可以是“m”，或者儿童服装可以是2-3年。希望有一个解决方案，其他人有什么想法吗？根据您的数据框架，您可以将其更新为具有代表性的，因为我不知道第3列和第4列的标准是什么。所以我排除了他们

mask=(df.groupby('code')['code'].transform('size')>1)
df1=df[mask]
df2=df[~mask]
s=df1.groupby('code',sort=False)['description'].apply(lambda x: ' '.join(x).split(' ')).explode()
s_not_duplicates=s.to_frame()[s.map(s.value_counts()>1)].drop_duplicates().groupby(level=0)['description'].apply(lambda x: ' '.join(x))
description_not_duplicates=pd.concat([s_not_duplicates,df2.description])
print(description_not_duplicates)

123456      nice shoes
234567    baby overall
456778           shirt
567890      boots size
Name: description, dtype: object