Python 分组配置文件字符串具有相同的单词,但顺序不一致

Python 分组配置文件字符串具有相同的单词,但顺序不一致,python,nlp,difflib,Python,Nlp,Difflib,我有一个数据框,其中包含一列配置文件类型,如下所示: 0 Android Java 1 Software Development Developer 2 Full-stack Developer 3 JavaScript Frontend Design 4

我有一个数据框,其中包含一列配置文件类型,如下所示:

0                                    Android Java
1                  Software Development Developer
2                            Full-stack Developer
3                      JavaScript Frontend Design
4                          Android iOS JavaScript
5                             Ruby JavaScript PHP
Frontend Design JavaScript  Design Frontend JavaScript
我使用NLP模糊匹配相似的配置文件,它返回以下相似性数据框:

left_side                       right_side                  similarity
7   JavaScript Frontend Design  Design JavaScript Frontend  0.849943
8   JavaScript Frontend Design  Frontend Design JavaScript  0.814599
9   JavaScript Frontend Design  JavaScript Frontend         0.808010
10  JavaScript Frontend Design  Frontend JavaScript Design  0.802881
12  Android iOS JavaScript      Android iOS Java            0.925126
15  Machine Learning Engineer   Machine Learning Developer  0.839165
21  Android Developer Developer Android Developer           0.872646
25  Design Marketing Testing    Design Marketing            0.817195
28  Quality Assurance           Quality Assurance Developer 0.948010
虽然这有所帮助,将我从478独特的个人资料带到461,但我想重点关注以下个人资料:

0                                    Android Java
1                  Software Development Developer
2                            Full-stack Developer
3                      JavaScript Frontend Design
4                          Android iOS JavaScript
5                             Ruby JavaScript PHP
Frontend Design JavaScript  Design Frontend JavaScript
我见过的唯一解决这个问题的工具是difflib? 我的问题是,还有什么其他技术可用于检查和标准化这些由相同单词组成但顺序不一致的配置文件,将其转换为一个标准字符串。 所以期望的输出是,获取一个包含“Design”、“Frontend”和“JavaScript”的字符串,并将其替换为“designfrontendjavascript”

现在,我正在将我的原始数据框与相似性数据框合并,以将右侧出现的所有配置文件字符串替换为左侧,但这意味着我正在将下面的右侧(“Java Python数据科学”)替换为下面的左侧(“JavaScript Python数据科学”)

任何帮助都将不胜感激

编辑***我写了以下文字来替换单词“保存”和“清理”人才库['profile']列中出现的所有单词,但这似乎不起作用?有人能指出我没有看到的东西吗?我会非常感激的

def standardize_word_order(row):
    words_to_keep = [
        "javascript frontend design",
        "android ios javascript",
        "android developer developer",
        "android developer",
        "quality assurance",
        "quality assurance engineer",
        "architecture developer",
        "big data architecture developer",
        "data architecture developer",
        "software architecture developer",
        "javascript python data science",
        "frontend php javascript",
        "javascript android ios",
        "frontend design javascript",
        "java python data science",
        "javascript frontend android",
        ".net javascript frontend",
    ]
    for word in words_to_keep:
        if (sorted(word.replace(" ", ""))) == sorted(
            row.replace(" ", "")
        ) and word != row:
            row.replace(row, word)
    return row

clean_talentpool["profile"] = clean_talentpool["profile"].apply(
    lambda x: standardize_word_order(x)
)

在你的情况下,我不会关注字符串,而是字符。基本上,如果两个字符串由它们匹配的相同字符(置换)组成

a = "Frontend Design JavaScript"
b = "Javascript Frontend Design"

sorted(a) == sorted(b)
#prints True
<>你可以考虑删除空间,做其他的预处理,比如低级外壳。< /P>
if sorted(a.lower().replace(" ","")) == sorted(b.lower().replace(" ","")):
    # they are the same, do something
根据您的示例,实施可能是:

def standardize_word_order(row):
    words_to_keep = [
        "javascript frontend design",
        "android ios javascript",
        "android developer developer",
        "android developer",
        "quality assurance",
        "quality assurance engineer",
        "architecture developer",
        "big data architecture developer",
        "data architecture developer",
        "software architecture developer",
        "javascript python data science",
        "frontend php javascript",
        "javascript android ios",
        "frontend design javascript",
        "java python data science",
        "javascript frontend android",
        ".net javascript frontend",
    ]
    for word in words_to_keep:
        if ((sorted(word.replace(" ", ""))) == sorted(
            row.replace(" ", "")
        ) and word != row):
            return word
    return row

clean_talentpool["profile"] = standardize_word_order(clean_talentpool["profile"])

非常感谢。这是一个伟大的战略。你知道我将如何使用满足if条件的单词,并用上面显示的单词_to_keep list替换它们吗?就我所见,我认为这里不需要lambda函数。lambda是匿名函数,用于封装动态声明的一小段逻辑,以保持代码紧凑。