Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/file/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/entity-framework/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用pandas read\u csv检测导入csv文件的头分隔符_Python_File_Csv_Pandas - Fatal编程技术网

Python 使用pandas read\u csv检测导入csv文件的头分隔符

Python 使用pandas read\u csv检测导入csv文件的头分隔符,python,file,csv,pandas,Python,File,Csv,Pandas,我们又来了 您好,我正在尝试检测CSV文件中的错误 该文件应如下所示 goodfile.csv "COL_A","COL_B","COL_C","COL_D" "ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD" "ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD" "ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD" "ROW4COLA","R

我们又来了

您好,我正在尝试检测CSV文件中的错误

该文件应如下所示

    goodfile.csv

    "COL_A","COL_B","COL_C","COL_D"
    "ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
    "ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
    "ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
    "ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
    "ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
    "ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
    "ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
但我的档案实际上是

    brokenfile.csv

    "COL_A","COL_B",COL C,"COL_D"
    "ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
    "ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
    "ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
    "ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
    "ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
    "ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
    "ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
当我用熊猫导入这两个文件时

    data = pd.read_csv('goodfile.csv')
    data = pd.read_csv('brokenfile.csv')
我得到了同样的结果

    data

          COL_A     COL_B     COL_C     COL_D
    0  ROW1COLA  ROW1COLB  ROW1COLC  ROW1COLD
    1  ROW2COLA  ROW2COLB  ROW2COLC  ROW2COLD
    2  ROW3COLA  ROW3COLB  ROW3COLC  ROW3COLD
    3  ROW4COLA  ROW4COLB  ROW4COLC  ROW4COLD
    4  ROW5COLA  ROW5COLB  ROW5COLC  ROW5COLD
    5  ROW6COLA  ROW6COLB  ROW6COLC  ROW6COLD
    6  ROW7COLA  ROW7COLB  ROW7COLC  ROW7COLD

无论如何,我想要的是检测第二个文件“brokenfile.csv”中的错误,该文件当前在头列之间缺少“”,Pandas试图在读取数据时智能地识别数据类型。这正是您所描述的情况中发生的情况,
COL\u C
COL\u C
都被解析为字符串

简而言之,没有要检测的错误!至少熊猫在这种情况下不会产生错误

您可以做的是,如果您想检测标题中缺少的引号,您可以尝试以更“传统”的python方式阅读第一行,并从中得出您自己的结论:

>>> with open('filename') as f:
        lines = f.readlines()
        ....

我认为您可以通过
~
检测
数据帧的列中缺少的
,使用和使用倒置的
布尔数组

import pandas as pd
import io

temp=u'''"COL_A","COL_B",COL C,"COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), quoting = 3)
print df
      "COL_A"     "COL_B"       COL C     "COL_D"
0  "ROW1COLA"  "ROW1COLB"  "ROW1COLC"  "ROW1COLD"
1  "ROW2COLA"  "ROW2COLB"  "ROW2COLC"  "ROW2COLD"
2  "ROW3COLA"  "ROW3COLB"  "ROW3COLC"  "ROW3COLD"
3  "ROW4COLA"  "ROW4COLB"  "ROW4COLC"  "ROW4COLD"
4  "ROW5COLA"  "ROW5COLB"  "ROW5COLC"  "ROW5COLD"
5  "ROW6COLA"  "ROW6COLB"  "ROW6COLC"  "ROW6COLD"
6  "ROW7COLA"  "ROW7COLB"  "ROW7COLC"  "ROW7COLD"

print df.columns
Index([u'"COL_A"', u'"COL_B"', u'COL C', u'"COL_D"'], dtype='object')

print df.columns.str.contains('"')
[ True  True False  True]

print ~df.columns.str.contains('"')
[False False  True False]

print df.columns[~df.columns.str.contains('"')]
Index([u'COL C'], dtype='object')

是的,我知道熊猫真的很聪明。但在编写程序时,我想提醒用户“在标题中。好吧,就熊猫而言,这里没有错误。我看不出在这样的情况下,您如何能够产生某种错误/警告。请参阅更新的答案。您可以将
quoting=3
传递到
read_csv
,这样熊猫将不会删除这些字符,但这两个文件都是有效的csv文件。我认为pandas没有任何理由提出错误。您的目标只是检测标题是否缺少双引号?单引号呢?应该检测到的任何其他“错误”?另外,你不应该“得到相同的结果”——一个数据应该有
COL_C
和另一个
COL C
哇,太好了,我必须学习很多东西才能像你一样。谢谢。是的,你一天救了我两次。非常感谢。我在哪里可以学到更多关于这个话题的知识?我刚开始是一名自由职业者,我曾经认为我知道python编程,但我发现我还有很多东西要学。对不起,如果我的英语不是很好。来自委内瑞拉的问候!斯洛伐克的问候。:)我觉得docs非常棒。有很多样品。而我的经验——最好的学习就是尝试回答这个问题。