Python规范化了大量列名称,优化了代码
我正在制作一个脚本来处理csv,我可以重用它。现在,我正在使用此代码规范化csv文件中的列,以便它们都可以具有类似的列Python规范化了大量列名称,优化了代码,python,regex,pandas,iteration,Python,Regex,Pandas,Iteration,我正在制作一个脚本来处理csv,我可以重用它。现在,我正在使用此代码规范化csv文件中的列,以便它们都可以具有类似的列 df = pd.read_csv('Crokis.csv', index_col=0, encoding = "ISO-8859-1", low_memory=False) genCol=['Genus','genus','ngenus','genera',] df.rename(columns={typo: 'Genus' for typo in genCol}, inpl
df = pd.read_csv('Crokis.csv', index_col=0, encoding = "ISO-8859-1", low_memory=False)
genCol=['Genus','genus','ngenus','genera',]
df.rename(columns={typo: 'Genus' for typo in genCol}, inplace=True)
spCol=['species', 'sp', 'Species']
df.rename(columns={typo: 'species' for typo in spCol}, inplace=True)
chromCol=['Chromosome count', 'chromosome', 'Cytology', '2n', 'Chromosome']
df.rename(columns={typo: 'chromosome' for typo in chromCol}, inplace=True)
del chromCol, spCol, genCol
它工作正常,但有两个问题
regex
或类似的东西来处理不同的变体'genus'
替换任何出现的'genus.*'
的示例。
它将匹配并替换,例如'genUS'
,'genUS'
,'genUS_666'
将熊猫作为pd导入
进口稀土
df=pd.read\u csv('Crokis.csv',index\u col=0,encoding=“ISO-8859-1”,低内存=False)
#“属”列重命名
f=lambda x:re.sub('genus.*','genus',x,flags=re.IGNORECASE)
重命名(columns=f,inplace=True)
我将通过以下方式处理此问题:
# use a single dict to hold the mapping
name_map = {'Genus': ['Genus','genus','ngenus','genera'],
'species':['species', 'sp', 'Species'],
'chromosome':['Chromosome count', 'chromosome', 'Cytology', '2n', 'Chromosome']}
col_translate = {}
for c in df.columns:
for canonical_name, alias_names in name_map.items():
for alias_name in alias_names:
if c.lower() == col_name.lower():
col_translate[c] = canonical_name
# if you want to check prefix or suffix...
elif c.startswith(alias_name) or c.endswith(alias_name)
col_translate[c] = canonical_name
# ... any additional, more complicated test
...
如果在某些情况下,
re
可能认为太难,那么它会更灵活在您的解决方案中,困扰我的是嵌套for循环往往会减慢代码的速度