如果在创建df和python usecols时不存在列,则跳过该列

如果在创建df和python usecols时不存在列,则跳过该列,python,pandas,Python,Pandas,我用熊猫来装载成千上万的CSV。然而,我只对一些列感兴趣,这些列可能不会出现在所有CSV中 如果在其中一个CSV中不存在指定的列名,则参数usecols似乎不起作用。最好的解决方法是什么?谢谢 import pandas as pd for fullPath in listFilenamesPath: df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])

我用熊猫来装载成千上万的CSV。然而,我只对一些列感兴趣,这些列可能不会出现在所有CSV中

如果在其中一个CSV中不存在指定的列名,则参数usecols似乎不起作用。最好的解决方法是什么?谢谢

import pandas as pd
for fullPath in listFilenamesPath:
    df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")

您可以在不使用
usecols
的情况下读取整个csv。这将允许您检查DataFrame具有哪些列。如果数据框没有所需的列,您可以忽略它或根据需要进行处理。

您可以在不使用
usecols
的情况下读取整个csv。这将允许您检查DataFrame具有哪些列。如果数据框没有所需的列,您可以忽略它或根据需要进行处理。

一种解决方法是获取列名,该列名可以同时出现在
usecols
列表(您要查找的列列表)和
df.columns
中。然后,您可以使用此常用列名列表来子集
df

代码中包含必要的注释:

### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']

for fullPath in listFilenamesPath:
    ### read the entire dataframe without usecols
    df = pd.read_csv(fullPath, sep= ";")
    ### get the column names that appear in both usecols list as well as df.columns
    final_list = list(set(usecols) & set(df.columns))
    ### subset it using the final_list
    df = df[final_list]
    ### write your df to csv and continue as usual
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")

演示: 以下是带有df的csv:

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
我要查找以下列:

usecols = ['A', 'D', 'B']
我读了整本书。我得到df和我要查找的列之间的公共列,在本例中它们是A和B,并将其子集如下所示:

df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)
输出:

   B  A
0  4  1
1  5  2
2  6  3

一种解决方法是获取同时出现在
usecols
列表(您要查找的列列表)和
df.columns
中的列名。然后,您可以使用此常用列名列表来子集
df

代码中包含必要的注释:

### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']

for fullPath in listFilenamesPath:
    ### read the entire dataframe without usecols
    df = pd.read_csv(fullPath, sep= ";")
    ### get the column names that appear in both usecols list as well as df.columns
    final_list = list(set(usecols) & set(df.columns))
    ### subset it using the final_list
    df = df[final_list]
    ### write your df to csv and continue as usual
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")

演示: 以下是带有df的csv:

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
我要查找以下列:

usecols = ['A', 'D', 'B']
我读了整本书。我得到df和我要查找的列之间的公共列,在本例中它们是A和B,并将其子集如下所示:

df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)
输出:

   B  A
0  4  1
1  5  2
2  6  3

如果找不到usecols参数中指定的列,则显示read_csv Trows ValueError。我认为您可以使用try-catch块并跳过抛出错误的文件

for fullPath in listFilenamesPath:
    try:
        df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    except ValueError:
        pass
或者捕获错误,尝试分析冲突的列名,然后使用子集重试。也许有一种更干净的方法可以做到这一点

import pandas as pd
import re

usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
    usecols_ = usecols
    while usecols_:
        try:
            df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
            break
        except ValueError as e:
            r = re.search(r"\[(.+)\]", str(e))
            missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
            usecols_ = [x for x in usecols_ if x not in missing_cols]   

    """
        rest of your code
    """

如果找不到usecols参数中指定的列,则显示read_csv Trows ValueError。我认为您可以使用try-catch块并跳过抛出错误的文件

for fullPath in listFilenamesPath:
    try:
        df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    except ValueError:
        pass
或者捕获错误,尝试分析冲突的列名,然后使用子集重试。也许有一种更干净的方法可以做到这一点

import pandas as pd
import re

usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
    usecols_ = usecols
    while usecols_:
        try:
            df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
            break
        except ValueError as e:
            r = re.search(r"\[(.+)\]", str(e))
            missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
            usecols_ = [x for x in usecols_ if x not in missing_cols]   

    """
        rest of your code
    """