Python 如何验证.csv文件中是否引用了所有值_Python_Pandas_Csv

Python 如何验证.csv文件中是否引用了所有值

python pandas csv

Python 如何验证.csv文件中是否引用了所有值,python,pandas,csv,Python,Pandas,Csv,我有数千个.csv文件，我需要检查是否所有文件都引用了它们的值我试着把它们都放到一个数据帧列表中，并用一个非常糟糕的代码来试试运气。我需要帮助 def csv_列表（文件夹）： path=r'C:\\'+文件夹+''使用您的路径所有_文件=glob.glob（路径+“/*.csv”） li=[] 对于所有_文件中的文件名： df=pd.read\u csv（文件名，索引列=None，头=0） li.追加（df）返回李 def check_双引号（csvfile）：如果（csvfile.Q

我有数千个.csv文件，我需要检查是否所有文件都引用了它们的值

我试着把它们都放到一个数据帧列表中，并用一个非常糟糕的代码来试试运气。我需要帮助

def csv_列表（文件夹）：
path=r'C:\\'+文件夹+''使用您的路径
所有_文件=glob.glob（路径+“/*.csv”）
li=[]
对于所有_文件中的文件名：
df=pd.read\u csv（文件名，索引列=None，头=0）
li.追加（df）
返回李
def check_双引号（csvfile）：
如果（csvfile.QUOTE_ALL==True）：
打印（“csv双引号”）

我犯了以下错误

AttributeError:'DataFrame'对象没有属性'QUOTE\u ALL'

如果您想检查文件是否被一致引用，可以通过两种方式进行。第一种方法是将所有数据加载到内存中，然后检查一致性。另一种是使用转换器。如果您想节省内存，这可能是一个选项

将所有数据加载到内存中第一种可能性如下：

import pandas as pd
import csv

# 1. read the file without removing the quotes (all colums will be string)
df= pd.read_csv('yourfile.csv', sep=';', dtype='str', skipinitialspace=True, quoting= csv.QUOTE_NONE)

# 2. now check that all fields are doublequoted:
#    the .str.replace below is called to remove
#    trailing spaces from the fields (behind the quotes)
#    the spaces at the beginning are removed by pandas (because of skipinitialspace=True)
df.apply(lambda ser: ser.str.startswith('"') 
                     & ser.str.replace(r'\s+$', '').str.endswith('"')
        ).all().all()

# define a check function (a converter from string to bool):
def check_quotes(val):
    stripped= val.strip()
    return stripped.startswith('"') & stripped.endswith('"')

# create a converter dict (just use a dict comprehension 
# if you don't know the column names, just make sure you
# chose a range at least as large as you have columns in
# your files (if your range is larger, it doesn't hurt)
conv_dict= {i: check_quotes for i in range(100)}
df= pd.read_csv('yourfile.csv', sep=';', index_col=[0], converters=conv_dict, quoting= csv.QUOTE_NONE)

# if the file is consistently quoted, the following line prints True
df.any().any()

测试代码：

import io

raw_csv='''""; "Col1"; "Col2" ; "Col3"; "C12"; "index"
"0"; "Bob"; "Joe"; "0.218111"; "BobJoe"; "1"
"1"; "Joe"; "Steve"; "0.849890"; "JoeSteve"; "2"
"2"; "Bill"; "Bob"; "0.316259"; "BillBob"; "0"
"3"; "Mary"; "Bob"; "0.179488"; "MaryBob"; "3"
"4"; "Joe"; "Steve"; "0.129853"; "JoeSteve"; "2"
"5"; "Anne"; "NaN"; "0.752859" ; "NaN"; "-1"
"6"; "NaN"; "Bill"; "0.414644"; "NaN"; "-1"
"7"; "NaN"; "NaN"; "0.026471"; "NaN"; "-1"'''

df= pd.read_csv(
        io.StringIO(raw_csv), 
        sep=';', index_col=[0], 
        dtype='str', 
        skipinitialspace=True, 
        quoting= csv.QUOTE_NONE)

print(df.apply(lambda ser: ser.str.startswith('"') 
                           & ser.str.replace(r'\s+$', '').str.endswith('"')
              ).all().all())
--> True

如果您愿意，您还可以使输出更加详细。例如，如果按照id

“2”

删除

Bob

周围的引号，则总体结果

False

（当然）和：

使用转换器带有转换器的版本的工作原理如下：

import pandas as pd
import csv

# 1. read the file without removing the quotes (all colums will be string)
df= pd.read_csv('yourfile.csv', sep=';', dtype='str', skipinitialspace=True, quoting= csv.QUOTE_NONE)

# 2. now check that all fields are doublequoted:
#    the .str.replace below is called to remove
#    trailing spaces from the fields (behind the quotes)
#    the spaces at the beginning are removed by pandas (because of skipinitialspace=True)
df.apply(lambda ser: ser.str.startswith('"') 
                     & ser.str.replace(r'\s+$', '').str.endswith('"')
        ).all().all()

# define a check function (a converter from string to bool):
def check_quotes(val):
    stripped= val.strip()
    return stripped.startswith('"') & stripped.endswith('"')

# create a converter dict (just use a dict comprehension 
# if you don't know the column names, just make sure you
# chose a range at least as large as you have columns in
# your files (if your range is larger, it doesn't hurt)
conv_dict= {i: check_quotes for i in range(100)}
df= pd.read_csv('yourfile.csv', sep=';', index_col=[0], converters=conv_dict, quoting= csv.QUOTE_NONE)

# if the file is consistently quoted, the following line prints True
df.any().any()

没有理由引用csv文件中的所有值。只能引用包含分隔符或其他引号的字段
引用所有值只会浪费空间和处理时间

此外，csv中的引号与字符串与数字之间的值无关。
如果一个巨大的csv文件中只有某行（在文件末尾）有未引号的值怎么办？基本上，这是一个测试。它应该记录一条类似“csv验证失败”的消息。您如何知道该值是否被引用？它是一个bool列，是在每一列中使用纯引号还是双引号。。。请提供一个。在这里，如果我们想“愚蠢地”回答您的问题，我们应该说您的条件
QUOTE_ALL
不正确，应该用另一种方式定义。基本上，您要做的是：阅读CSV。检查CSV是否完成测试。然后将结果放回另一个模板：测试结果（dataframe、json、数据库，格式选择由您决定）。当
pandas
从csv读取数据时，它会删除配额，因为不需要它们来处理数据。它们不是数据的一部分，只是列的开始和结束位置的信息，可能有逗号或其他特殊字符。模块
csv