如何从另一列中的字符串中获取下划线左右的单词?python

如何从另一列中的字符串中获取下划线左右的单词?python,python,regex,pandas,dataframe,Python,Regex,Pandas,Dataframe,我已将目录中的所有.csv连接到一个大数据框中,其中一列是每个文件的文件名 以下是我的文件名: ['Accelerometer-2011-05-30-09-36-50-brush_teeth-f1.txt', 'Accelerometer-2011-05-30-08-35-11-brush_teeth-f1.txt', 'Accelerometer-2011-06-02-10-45-50-brush_teeth-f1.txt', 'Accelerometer-2011-06-02-1

我已将目录中的所有.csv连接到一个大数据框中,其中一列是每个文件的文件名

以下是我的文件名:

['Accelerometer-2011-05-30-09-36-50-brush_teeth-f1.txt',

 'Accelerometer-2011-05-30-08-35-11-brush_teeth-f1.txt',

 'Accelerometer-2011-06-02-10-45-50-brush_teeth-f1.txt',

 'Accelerometer-2011-06-02-10-42-22-brush_teeth-f1.txt',

 'Accelerometer-2011-05-31-15-16-47-brush_teeth-f1.txt',

 'Accelerometer-2011-05-30-21-55-04-brush_teeth-m2.txt',

 'Accelerometer-2011-04-11-13-28-18-brush_teeth-f1.txt',

etc...]
我想创建另一个名为['Action']的列,它将从文件名中提取下划线的左侧和右侧。因此,在这种情况下,它将是“刷牙”


我将如何在python中实现这一点?

我建议,根据文件大小,在加载到pandas之前,使用普通python完成所有的争论,因为速度会更快。我的两分钱。这是解决问题的一种方法。我相信评论部分的选项将不仅仅解决这个问题。由于位置的原因,它的索引为-2

text = ['Accelerometer-2011-05-30-09-36-50-brush_teeth-f1.txt',

        'Accelerometer-2011-05-30-08-35-11-brush_teeth-f1.txt',

        'Accelerometer-2011-06-02-10-45-50-brush_teeth-f1.txt',

        'Accelerometer-2011-06-02-10-42-22-brush_teeth-f1.txt',

        'Accelerometer-2011-05-31-15-16-47-brush_teeth-f1.txt',

        'Accelerometer-2011-05-30-21-55-04-brush_teeth-m2.txt',

        'Accelerometer-2011-04-11-13-28-18-brush_teeth-f1.txt',]

(pd.DataFrame(text)
.assign(Action = lambda x: x[0].str.split('-').str[-2]))

       0                                                Action
0   Accelerometer-2011-05-30-09-36-50-brush_teeth-...   brush_teeth
1   Accelerometer-2011-05-30-08-35-11-brush_teeth-...   brush_teeth
2   Accelerometer-2011-06-02-10-45-50-brush_teeth-...   brush_teeth
3   Accelerometer-2011-06-02-10-42-22-brush_teeth-...   brush_teeth
4   Accelerometer-2011-05-31-15-16-47-brush_teeth-...   brush_teeth
5   Accelerometer-2011-05-30-21-55-04-brush_teeth-...   brush_teeth
6   Accelerometer-2011-04-11-13-28-18-brush_teeth-...   brush_teeth

我同意这些评论。根据命名模式的稳定性,您根本不需要正则表达式。你可以这样解决:

mylist = ['Accelerometer-2011-05-30-09-36-50-brush_teeth-f1.txt',
 'Accelerometer-2011-05-30-08-35-11-brush_teeth-f1.txt',
 'Accelerometer-2011-06-02-10-45-50-wash_face-f1.txt',
 'Accelerometer-2011-06-02-10-42-22-brush_hair-f1.txt',
 'Accelerometer-2011-05-31-15-16-47-wash_hair-f1.txt',
 'Accelerometer-2011-05-30-21-55-04-iron_clothes-m2.txt',
 'Accelerometer-2011-04-11-13-28-18-make_bed-f1.txt']

output = []
for i in mylist:
    result = i.split("-")
    for z in result:
        if "_" in z:
            output.append(z)

print(output)
>>> ['brush_teeth', 'brush_teeth', 'wash_face', 'brush_hair', 'wash_hair', 'iron_clothes', 'make_bed']

我们可以使用正则表达式提取文件扩展名之前的值,该值也不等于使用正向前瞻的破折号

df['file_name'].str.extract(r'(\w+(?=\s*-\w+\.[^\.]))')


您的所有文件都遵循相同的命名模式吗<代码>“a-b-c-action_name-x-y-z.txt”?@Guimoute是的,它们都遵循相同的命名模式!因此,请使用
split('-')[3]
或任何您需要的数字。您可以使用regex
\W(\W+\uw+)\W
如果匹配短语后始终有相同数量的连字符,此解决方案将起作用。如果偶尔出现不同数量的连字符,则会失败。我喜欢lambda函数的优雅。@Matt,谢谢你的反馈。OP说这些文件具有相同的模式,因此解决方案是这样的。
print(df)
                                           file_name       action
0  Accelerometer-2011-05-30-09-36-50-brush_teeth-...  brush_teeth
1  Accelerometer-2011-05-30-08-35-11-brush_teeth-...  brush_teeth
2  Accelerometer-2011-06-02-10-45-50-brush_teeth-...  brush_teeth
3  Accelerometer-2011-06-02-10-42-22-brush_teeth-...  brush_teeth
4  Accelerometer-2011-05-31-15-16-47-brush_teeth-...  brush_teeth
5  Accelerometer-2011-05-30-21-55-04-brush_teeth-...  brush_teeth
6  Accelerometer-2011-04-11-13-28-18-brush_teeth-...  brush_teeth