Python 如何从pandas中的文件名中提取标签_Python_Pandas

Python 如何从pandas中的文件名中提取标签

python pandas

Python 如何从pandas中的文件名中提取标签,python,pandas,Python,Pandas,我在training文件夹中有一些文件，文件名是train0947.txt、train038.txt、test0498.txt和test032.txt。我希望获得如下所示的输出。我在过去也做过类似的工作，但是文件名有一些分割标准，比如说train_0947.txt，很容易分割，但是现在对我来说分割有点困难。有人能帮助我实现这样的产出吗 file class train0947.txt train train038.txt train test0498.txt tes

我在training文件夹中有一些文件，文件名是train0947.txt、train038.txt、test0498.txt和test032.txt。我希望获得如下所示的输出。我在过去也做过类似的工作，但是文件名有一些分割标准，比如说train_0947.txt，很容易分割，但是现在对我来说分割有点困难。有人能帮助我实现这样的产出吗

file           class
train0947.txt  train
train038.txt   train
test0498.txt   test
test032.txt    test

我的代码：

em = glob("/home/xx/PycharmProjects/Dat/main/training/*.txt")
train_em = []
train_class = []
for i in tqdm(range(len(em))):
    # creating the image name
    train_em.append(em[i].split('/')[7])
    # creating the class of image
    train_class.append(em[i].split('/')[7].split('')[0]) --> here what split i should use?

# storing the images and their class in a dataframe
train_data = pd.DataFrame()
train_data['em'] = train_em
train_data['class'] = train_class

让我们尝试提取紧跟在

数字之后的所有非数字

简而言之：v（？=w）
-也称为先行断言。仅当后跟w
时才匹配v

代码如下：
df['class']=df['file'].str.extract('(\D+(?=\d))')

您可以使用正则表达式筛选感兴趣的文件并提取其类型（测试或训练）：
导入操作系统
进口稀土
作为pd进口熊猫
filename_regex=re.compile（r'^（训练测试）（？：\d+\.txt$）
root_folder=“/home/xx/PycharmProjects/Dat/main/training”
列车，列车等级=[]，[]
对于下一个文件名（os.walk（root_文件夹））[2]：
m=文件名\u regex.match（文件名）
如果m为无：
持续
#创建图像名称
序列号附加（文件名）
#创建图像类
列车组类追加（m.groups（）[0]）
#将图像及其类存储在数据帧中
列车数据=pd.DataFrame（）
列车数据['em']=列车
列车数据['class']=列车等级
假设您的所有文件都是train
或test
：
import glob


def classify_file(dir):
    files = {}
    for file in glob.glob(f"{dir}//*.txt"):
        if "train" in file:
            files[file] = "train"
        elif "test" in file:
            files[file] = "test"
        else:
            files[file] = "Other"

    return files

我得到一个错误，说在for loop停止迭代，这意味着根文件夹
不包含任何文件：您是否检查了路径是否正确？