Python 提取括号之间的文本，并为每一位文本创建行_Python_Pandas_Dataframe

Python 提取括号之间的文本，并为每一位文本创建行

python pandas dataframe

Python 提取括号之间的文本，并为每一位文本创建行,python,pandas,dataframe,Python,Pandas,Dataframe,在熊猫数据框中，我需要提取方括号之间的文本，并将该文本作为新列输出。我需要在“StudyID”级别执行此操作，并为提取的每一位文本创建新行下面是一个简化的数据帧示例 data = { "studyid":['101', '101', '102', '103'], "Question":["Q1",

在熊猫数据框中，我需要提取方括号之间的文本，并将该文本作为新列输出。我需要在“StudyID”级别执行此操作，并为提取的每一位文本创建新行

下面是一个简化的数据帧示例

data = {
    "studyid":['101', 
                '101', 
                '102', 
                '103'],
    "Question":["Q1",
                "Q2",
                "Q1",
                "Q3"],
    "text":['I love [Bananas] and also [oranges], and [figs]',
            'Yesterday I ate [Apples]',
            '[Grapes] are my favorite fruit',
            '[Mandarins] taste like [oranges] to me'],
}
df2 = pd.DataFrame(data)

我制定了一个解决方案（请参阅下面的代码，如果您运行它，它将显示所需的输出），但是它非常长，需要很多步骤。我想知道是否有一个更短的方法来做这件事

您将看到，我在正则表达式中使用了str.findall（），但我最初尝试了str.extractall（），它将提取的文本输出到数据帧，但我不知道如何使用extractall（）生成的数据帧中包含的“studyid”和“question”列输出提取的文本。所以我求助于使用str.findall（）

这是我的代码（‘我知道它很笨重’）-如何减少步骤数？提前感谢您的帮助

 # Step 1: Use Regex to pull out the text between the square brackets
df3 = pd.DataFrame(df2['text'].str.findall(r"(?<=\[)([^]]+)(?=\])").tolist())

  # Step 2: Merge the extracted text back with the original data
df3 = df2.merge(df3, left_index=True, right_index=True)

  # Step 3: Transpose the wide file to a long file (e.g. panel)
df4 = pd.melt(df3, id_vars=['studyid', 'Question'], value_vars=[0, 1, 2])

  # Step 4: Delete rows with None in the value column
indexNames = df4[df4['value'].isnull()].index
df4.drop(indexNames , inplace=True)

  # Step 5: Sort the data by the StudyID and Question
df4.sort_values(by=['studyid', 'Question'], inplace=True)

  # Step 6: Drop unwanted columns
df4.drop(['variable'], axis=1, inplace=True)

  # Step 7: Reset the index and drop the old index
df4.reset_index(drop=True, inplace=True)

df4

#步骤1：使用Regex拉出方括号之间的文本
df3=pd.DataFrame（df2['text'].str.findall（r“（？如果可以使用向列分配后输出，则最后一个for unique索引与drop=True
一起使用）：
df2['text'] = df2['text'].str.findall(r"(?<=\[)([^]]+)(?=\])")

df4 = df2.explode('text').reset_index(drop=True)


您可以将代码“压缩”为单个指令：
df2[['studyid', 'Question']].join(df2['text'].str.findall(
    r'\[([^]]+)\]').explode().rename('value'))

甚至正则表达式也可以简化：不需要lookback/lookforward。
只需在捕获组之前/之后放置两个括号
如果需要，请将此结果保存在变量下（例如df4=…）
注意：在您的解决方案中，您命名了最终结果（df4）中的最后一列
作为值，所以我在解决方案中重复了它。
但是，如果您想将此名称更改为您想要的任何名称，请替换“value”
在我的解决方案中使用您选择的另一个名称。非常感谢jezrael，这太棒了！str.extractall（）解决方案正是我第一次使用此解决方案时所寻找的。我知道会有一个简单的解决方案！太棒了！感谢Valdi_Bo提供的另一个伟大的解决方案！
print (df4)
  studyid Question       text
0     101       Q1    Bananas
1     101       Q1    oranges
2     101       Q1       figs
3     101       Q2     Apples
4     102       Q1     Grapes
5     103       Q3  Mandarins
6     103       Q3    oranges

df2[['studyid', 'Question']].join(df2['text'].str.findall(
    r'\[([^]]+)\]').explode().rename('value'))