Python：拆分字符串，使每个子字符串都是字典中的一个键_Python_String_Dataframe_Dictionary

Python：拆分字符串，使每个子字符串都是字典中的一个键

python string dataframe dictionary

Python：拆分字符串，使每个子字符串都是字典中的一个键,python,string,dataframe,dictionary,Python,String,Dataframe,Dictionary,我有一个示例字符串： “青苹果，狡猾的狐狸，狡猾的狐狸皮，凉水，黄沙” 还有一本字典： strr_dict = {"green": "color", "apple": "fruit", "sly": "behavior", "fox": "animal", "cunning": "behavior"

我有一个示例字符串：

“青苹果，狡猾的狐狸，狡猾的狐狸皮，凉水，黄沙”

还有一本字典：

strr_dict = {"green": "color", "apple": "fruit", "sly": "behavior", "fox": "animal", "cunning": "behavior", "quick fox": "animal", "cool water": "drink", "yellow": "color", "sand": "matter"}

我想将字符串中的子字符串及其字典中的值显示为数据帧。这就是我所做的：

    import pandas as pd

    sample_str = "green apple, sly fox, cunning quick fox fur, cool water, yellow sand"
    strr_dict = {"green": "color", "apple": "fruit", "sly": "behavior", "fox": "animal", "cunning": "behavior", "quick fox": "animal", "cool water": "drink", "yellow": "color", "sand": "matter"}

    df_list = []
    stripped_list = [i.strip() for i in sample_str.split(',')]
    
    for i in stripped_list:
      if i in strr_dict:
        df_list.append([i, strr_dict[i]])
      else:
        for j in i.split(): 
          if j in strr_dict:
              df_list.append([j, strr_dict[j]])
          else:
            df_list.append([j, ""])
    
    strr_df = pd.DataFrame(df_list, columns=['Text', 'Value'])
    print(strr_df)

我得到的结果是：

             Text      Value
    0        green     color
    1        apple     fruit
    2          sly     behavior
    3          fox     animal
    4      cunning     behavior
    5        quick          
    6          fox     animal
    7          fur          
    8   cool water     drink
    9       yellow     color
    10        sand     matter

我期望的输出是：

             Text      Value
    0        green     color
    1        apple     fruit
    2          sly     behavior
    3          fox     animal
    4      cunning     behavior
    5    quick fox     animal
    6          fur          
    7   cool water     drink
    8       yellow     color
    9         sand     matter

如果子字符串与字典键完全匹配，我想显示这些值。我想知道如何相应地拆分字符串。在这种情况下，

cunning quick fox fur

应拆分为

cunning

，

quick fox

，

fur

。但情况并非总是如此，有时应将其拆分为

cunning

，

quick fox fur

，以从字典中获取其值。我对如何处理这种情况感到非常困惑。

因此这确实给出了您指定的输出。我不知道你为什么想要这样做，我也不知道这是否适用于你可能有的其他输入情况，但它应该-可以随意使用你准备好的任何其他eldritch数据集进行测试

import pandas as pd

sample_str = "green apple, sly fox, cunning quick fox fur, cool water, yellow sand"
strr_dict = {"green": "color", "apple": "fruit", "sly": "behavior", "fox": "animal", "cunning": "behavior",
             "quick fox": "animal", "cool water": "drink", "yellow": "color", "sand": "matter"}

df_list = []
stripped_list = [i.strip() for i in sample_str.split(',')]


checklist = []

for i in stripped_list:
    if i in strr_dict:
        df_list.append([i, strr_dict[i]])
        checklist.append(i)
    else:
        for z in list(strr_dict.keys()):
            if z in str(checklist):
                continue
            if z in i:
                try:
                    df_list.append([i, strr_dict[i]])
                    checklist.append(i)
                except:
                    df_list.append([z, strr_dict[z]])
                    checklist.append(z)
    for x in i.split():
        if x not in str(checklist) and x not in list(strr_dict.keys()):
            df_list.append([x, ""])



strr_df = pd.DataFrame(df_list, columns=['Text', 'Value'])
print(strr_df)

输出：

         Text     Value
0       green     color
1       apple     fruit
2         sly  behavior
3         fox    animal
4     cunning  behavior
5   quick fox    animal
6         fur          
7  cool water     drink
8      yellow     color
9        sand    matter

Process finished with exit code 0

“青苹果，狡猾的狐狸，狡猾的狐狸皮，凉水，黄沙”

所以有时候每个单词，用空格隔开，是一个键，但有时候两个单词合为一个键？输入非常混乱@Flying Thunder，没错。有时每个词都是一个键，有时两个或两个以上的词加在一起就是一个键。@动物学家，逻辑是什么？计算机如何知道两个词何时属于同一个词，何时不属于同一个词？您必须检查每个

，

分隔字符串与所有字典键，检查此处是否包含一个键，然后，当多个键（一个词和一个两个词，例如quick fox和fox）时，会发生什么？您的示例似乎只需要最长的匹配，所以这听起来是可行的，但是（我知道这是一个stackoverflow的陈词滥调），听起来只需确保您的输入正确就更容易了formated@FlyingThunder，是的，这样检查每一把钥匙是可能的，但我一直在寻找更有效的解决方案。嗨，非常感谢，它在大多数情况下都有效。对于以下情况：

狡猾的quick fox fur yellow sand

，它适用于此字符串，但是

cool water

之后结尾的

yellow sand

不会显示。这是我尝试执行的NLP过程的一部分，我想将值显示为数据帧。您的意思是什么？如果输入不起作用，您的输入是什么？当我使用这个输入时，

“青苹果，狡猾的狐狸，狡猾的狐狸，黄色的沙子，凉水”

它仍然在工作