如果在创建df和python usecols时不存在列,则跳过该列
我用熊猫来装载成千上万的CSV。然而,我只对一些列感兴趣,这些列可能不会出现在所有CSV中 如果在其中一个CSV中不存在指定的列名,则参数usecols似乎不起作用。最好的解决方法是什么?谢谢如果在创建df和python usecols时不存在列,则跳过该列,python,pandas,Python,Pandas,我用熊猫来装载成千上万的CSV。然而,我只对一些列感兴趣,这些列可能不会出现在所有CSV中 如果在其中一个CSV中不存在指定的列名,则参数usecols似乎不起作用。最好的解决方法是什么?谢谢 import pandas as pd for fullPath in listFilenamesPath: df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
import pandas as pd
for fullPath in listFilenamesPath:
df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
nrFiles = nrFiles + 1
print(nrFiles, "files converted")
您可以在不使用
usecols
的情况下读取整个csv。这将允许您检查DataFrame具有哪些列。如果数据框没有所需的列,您可以忽略它或根据需要进行处理。您可以在不使用usecols
的情况下读取整个csv。这将允许您检查DataFrame具有哪些列。如果数据框没有所需的列,您可以忽略它或根据需要进行处理。一种解决方法是获取列名,该列名可以同时出现在usecols
列表(您要查找的列列表)和df.columns
中。然后,您可以使用此常用列名列表来子集df
代码中包含必要的注释:
### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
### read the entire dataframe without usecols
df = pd.read_csv(fullPath, sep= ";")
### get the column names that appear in both usecols list as well as df.columns
final_list = list(set(usecols) & set(df.columns))
### subset it using the final_list
df = df[final_list]
### write your df to csv and continue as usual
df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
nrFiles = nrFiles + 1
print(nrFiles, "files converted")
演示: 以下是带有df的csv:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
我要查找以下列:
usecols = ['A', 'D', 'B']
我读了整本书。我得到df和我要查找的列之间的公共列,在本例中它们是A和B,并将其子集如下所示:
df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)
输出:
B A
0 4 1
1 5 2
2 6 3
一种解决方法是获取同时出现在
usecols
列表(您要查找的列列表)和df.columns
中的列名。然后,您可以使用此常用列名列表来子集df
代码中包含必要的注释:
### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
### read the entire dataframe without usecols
df = pd.read_csv(fullPath, sep= ";")
### get the column names that appear in both usecols list as well as df.columns
final_list = list(set(usecols) & set(df.columns))
### subset it using the final_list
df = df[final_list]
### write your df to csv and continue as usual
df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
nrFiles = nrFiles + 1
print(nrFiles, "files converted")
演示: 以下是带有df的csv:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
我要查找以下列:
usecols = ['A', 'D', 'B']
我读了整本书。我得到df和我要查找的列之间的公共列,在本例中它们是A和B,并将其子集如下所示:
df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)
输出:
B A
0 4 1
1 5 2
2 6 3
如果找不到usecols参数中指定的列,则显示read_csv Trows ValueError。我认为您可以使用try-catch块并跳过抛出错误的文件
for fullPath in listFilenamesPath:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
except ValueError:
pass
或者捕获错误,尝试分析冲突的列名,然后使用子集重试。也许有一种更干净的方法可以做到这一点
import pandas as pd
import re
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
usecols_ = usecols
while usecols_:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
break
except ValueError as e:
r = re.search(r"\[(.+)\]", str(e))
missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
usecols_ = [x for x in usecols_ if x not in missing_cols]
"""
rest of your code
"""
如果找不到usecols参数中指定的列,则显示read_csv Trows ValueError。我认为您可以使用try-catch块并跳过抛出错误的文件
for fullPath in listFilenamesPath:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
except ValueError:
pass
或者捕获错误,尝试分析冲突的列名,然后使用子集重试。也许有一种更干净的方法可以做到这一点
import pandas as pd
import re
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
usecols_ = usecols
while usecols_:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
break
except ValueError as e:
r = re.search(r"\[(.+)\]", str(e))
missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
usecols_ = [x for x in usecols_ if x not in missing_cols]
"""
rest of your code
"""