Python 使用pandas read\u csv检测导入csv文件的头分隔符
我们又来了 您好,我正在尝试检测CSV文件中的错误 该文件应如下所示Python 使用pandas read\u csv检测导入csv文件的头分隔符,python,file,csv,pandas,Python,File,Csv,Pandas,我们又来了 您好,我正在尝试检测CSV文件中的错误 该文件应如下所示 goodfile.csv "COL_A","COL_B","COL_C","COL_D" "ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD" "ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD" "ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD" "ROW4COLA","R
goodfile.csv
"COL_A","COL_B","COL_C","COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
但我的档案实际上是
brokenfile.csv
"COL_A","COL_B",COL C,"COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
当我用熊猫导入这两个文件时
data = pd.read_csv('goodfile.csv')
data = pd.read_csv('brokenfile.csv')
我得到了同样的结果
data
COL_A COL_B COL_C COL_D
0 ROW1COLA ROW1COLB ROW1COLC ROW1COLD
1 ROW2COLA ROW2COLB ROW2COLC ROW2COLD
2 ROW3COLA ROW3COLB ROW3COLC ROW3COLD
3 ROW4COLA ROW4COLB ROW4COLC ROW4COLD
4 ROW5COLA ROW5COLB ROW5COLC ROW5COLD
5 ROW6COLA ROW6COLB ROW6COLC ROW6COLD
6 ROW7COLA ROW7COLB ROW7COLC ROW7COLD
无论如何,我想要的是检测第二个文件“brokenfile.csv”中的错误,该文件当前在头列之间缺少“”,Pandas试图在读取数据时智能地识别数据类型。这正是您所描述的情况中发生的情况,
COL\u C
和COL\u C
都被解析为字符串
简而言之,没有要检测的错误!至少熊猫在这种情况下不会产生错误
您可以做的是,如果您想检测标题中缺少的引号,您可以尝试以更“传统”的python方式阅读第一行,并从中得出您自己的结论:
>>> with open('filename') as f:
lines = f.readlines()
....
我认为您可以通过
~
检测数据帧的列中缺少的“
,使用和使用倒置的布尔数组
:
import pandas as pd
import io
temp=u'''"COL_A","COL_B",COL C,"COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), quoting = 3)
print df
"COL_A" "COL_B" COL C "COL_D"
0 "ROW1COLA" "ROW1COLB" "ROW1COLC" "ROW1COLD"
1 "ROW2COLA" "ROW2COLB" "ROW2COLC" "ROW2COLD"
2 "ROW3COLA" "ROW3COLB" "ROW3COLC" "ROW3COLD"
3 "ROW4COLA" "ROW4COLB" "ROW4COLC" "ROW4COLD"
4 "ROW5COLA" "ROW5COLB" "ROW5COLC" "ROW5COLD"
5 "ROW6COLA" "ROW6COLB" "ROW6COLC" "ROW6COLD"
6 "ROW7COLA" "ROW7COLB" "ROW7COLC" "ROW7COLD"
print df.columns
Index([u'"COL_A"', u'"COL_B"', u'COL C', u'"COL_D"'], dtype='object')
print df.columns.str.contains('"')
[ True True False True]
print ~df.columns.str.contains('"')
[False False True False]
print df.columns[~df.columns.str.contains('"')]
Index([u'COL C'], dtype='object')
是的,我知道熊猫真的很聪明。但在编写程序时,我想提醒用户“在标题中。好吧,就熊猫而言,这里没有错误。我看不出在这样的情况下,您如何能够产生某种错误/警告。请参阅更新的答案。您可以将quoting=3
传递到read_csv
,这样熊猫将不会删除这些字符,但这两个文件都是有效的csv文件。我认为pandas没有任何理由提出错误。您的目标只是检测标题是否缺少双引号?单引号呢?应该检测到的任何其他“错误”?另外,你不应该“得到相同的结果”——一个数据应该有COL_C
和另一个COL C
哇,太好了,我必须学习很多东西才能像你一样。谢谢。是的,你一天救了我两次。非常感谢。我在哪里可以学到更多关于这个话题的知识?我刚开始是一名自由职业者,我曾经认为我知道python编程,但我发现我还有很多东西要学。对不起,如果我的英语不是很好。来自委内瑞拉的问候!斯洛伐克的问候。:)我觉得docs非常棒。有很多样品。而我的经验——最好的学习就是尝试回答这个问题。