在python中使用csv.DictReader进行数据类型转换的最快方法

在python中使用csv.DictReader进行数据类型转换的最快方法,python,csv,dictionary,type-conversion,Python,Csv,Dictionary,Type Conversion,我正在使用python中的一个CSV文件,它在使用时大约有100000行。每行有一组维度(作为字符串)和一个度量(浮点) 由于csv.DictReader或csv.reader仅以字符串形式返回值,我目前正在迭代所有行,并将一个数值转换为浮点值 for i in csvDict: i[col] = float(i[col]) 有没有更好的方法,任何人都可以建议这样做?我一直在玩弄map、izip、itertools的各种组合,并广泛搜索了一些更有效的示例,但不幸的是没有取得太大的成功

我正在使用python中的一个CSV文件,它在使用时大约有100000行。每行有一组维度(作为字符串)和一个度量(浮点)

由于csv.DictReader或csv.reader仅以字符串形式返回值,我目前正在迭代所有行,并将一个数值转换为浮点值

for i in csvDict:
    i[col] = float(i[col])
有没有更好的方法,任何人都可以建议这样做?我一直在玩弄map、izip、itertools的各种组合,并广泛搜索了一些更有效的示例,但不幸的是没有取得太大的成功

如果有帮助: 我在appengine上做这个。我相信我所做的可能会导致我犯下以下错误: 在总共处理了11个请求之后,超过了软进程大小限制,达到267.789 MB-我只在CSV相当大时才得到它

编辑:我的目标 我正在解析此CSV,以便将其用作。最终的数据集将加载到gviz数据表中进行查询。在构造此表的过程中,必须指定类型。如果有人知道python中有一个好的gviz csv->datatable转换器,我的问题也可以解决

Edit2:我的代码

我相信我的问题与我试图修复csvtypes()的方式有关。此外,data_table.LoadData()需要一个iterable对象

class GvizFromCsv(object):
  """Convert CSV to Gviz ready objects."""

  def __init__(self, csvFile, dateTimeFormat=None):
    self.fileObj = StringIO.StringIO(csvFile)
    self.csvDict = list(csv.DictReader(self.fileObj))
    self.dateTimeFormat = dateTimeFormat
    self.headers = {}
    self.ParseHeaders()
    self.fixCsvTypes()

  def IsNumber(self, st):
    try:
        float(st)
        return True
    except ValueError:
        return False

  def IsDate(self, st):
    try:
      datetime.datetime.strptime(st, self.dateTimeFormat)
    except ValueError:
      return False

  def ParseHeaders(self):
    """Attempts to figure out header types for gviz, based on first row"""
    for k, v in self.csvDict[0].items():
      if self.IsNumber(v):
        self.headers[k] = 'number'
      elif self.dateTimeFormat and self.IsDate(v):
        self.headers[k] = 'date'
      else:
        self.headers[k] = 'string'

  def fixCsvTypes(self):
    """Only fixes numbers."""
    update_to_numbers = []
    for k,v in self.headers.items():
      if v == 'number':
        update_to_numbers.append(k)
    for i in self.csvDict:
      for col in update_to_numbers:
        i[col] = float(i[col])

  def CreateDataTable(self):
    """creates a gviz data table"""
    data_table = gviz_api.DataTable(self.headers)
    data_table.LoadData(self.csvDict)
    return data_table
首先,如果只需要可视化这些数据,则不需要进行任何转换:gviz可以处理JSON(基于文本的,你知道)或CSV(你已经有了它,不需要解析!)。您可以在任何合理的web服务器上对该文件提出质疑,并允许通过奇特的GET请求gviz问题访问该文件,基本上可以忽略参数

但我们假设您需要处理。看起来您不仅要读取CSV文件,还要尝试将其完全存储在RAM中。这可能是不切实际的:随着您添加更多处理,您将越来越快地达到RAM限制。一次处理一行数据(如果应用窗口过滤器等,则为合理数量的行),并将处理后的行放入数据存储区,而不是任何列表等。同样地,当通过GET请求提供数据时,读取/处理行,将其写入响应,并且不将其放入任何列表或诸如此类的内容


我认为转换技术没有问题,只要你以后在代码中合理地使用
I
,并且在运行过程中不要记住所有
I

有两个不同的东西: “数据源”和“数据表”

“数据源”是Google Visualization API服务器作为可视化web服务交付的格式化数据的名称:

This page describes how you can implement a data source to feed data
to visualizations built on the Google Visualization API. 

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source.html 
“数据源”的名称包括“有线协议”的概念:

要实现“数据源”,有两种可能性:

• Use one of the data source libraries listed in the Data Sources and Tools Gallery. 
All the data source libraries listed on that page implement the wire protocol.

• Write your own data source from scratch, 

http://code.google.com/intl/fr/apis/visualization/documentation/dev/implementing_data_source_overview.html
从以下方面:

• ... Data Sources and Tools Gallery : (....) You therefore need write only the
code needed to make your data available to the library in the form of a data table. 

• Write your own data source from scratch, as described in the
Writing your own Data Source
我理解,从零开始,我们需要实现wire协议+创建“数据表”,而使用数据源库,我们只需要创建“数据表”


有关于创建“数据源”的页面

在我看来,地址上的例子是关于创建“数据源”的,那里给出的答案是可疑的。但我不太清楚


但这些页面和主题对您来说并不有趣,事实上,如果我理解清楚的话,您希望知道如何准备数据,即所谓的“数据表”,通过“数据源”提供服务,而不是构建“数据源”

因此,“数据表”的编制是关键

这是:

There are two ways to create/populate your visualization's data table:

•Query a data provider. A data provider is another site that returns
a populated DataTable in response to a request from your code. 
Some data providers also accept SQL-like query strings to sort or 
filter the data. See Data Queries for more information and an example
of a query.

•Create and populate your own DataTable by hand. You can populate your
DataTable in code on your page. The simplest way to do this is to create
a DataTable object without any data and populate it by calling addRows()
on it. You can also pass a JavaScript literal representation of the data
table into the DataTable constructor, but this is more complex and is
covered on the reference page.

http://code.google.com/intl/fr/apis/visualization/documentation/using_overview.html#preparedata
更多信息可在此处找到:

2. Describe your table schema
The table schema is specified by the table_description parameter
passed into the constructor. You cannot change it later. 
The schema describes all the columns in the table: the data type of
each column, the ID, and an optional label.

Each column is described by a tuple: (ID [,data_type [,label [,custom_properties]]]). 



The table schema is a collection of column descriptor tuples. 
Every list member, dictionary key or dictionary value must be either 
another collection or a descriptor tuple. You can use any combination 
of dictionaries or lists, but every key, value, or member must
eventually evaluate to a descriptor tuple. Here are some examples.

•List of columns: [('a', 'number'), ('b', 'string')]
•Dictionary of lists: {('a', 'number'): [('b', 'number'), ('c', 'string')]}
•Dictionary of dictionaries: {('a', 'number'): {'b': 'number', 'c': 'string'}}
•And so on, with any level of nesting.


3. Populate your data
To add data to the table, build a structure of data elements in the
exact same structure as the table schema. So, for example, if your
schema is a list, the data must be a list: 

•schema: [("color", "string"), ("shape", "string")] 
•data: [["blue", "square"], ["red", "circle"]] 
If the schema is a dictionary, the data must be a dictionary:

•schema: {("rowname", "string"): [("color", "string"), ("shape", "string")] }
•data: {"row1": ["blue", "square"], "row2": ["red", "circle"]}

http://code.google.com/intl/fr/apis/visualization/documentation/dev/gviz_api_lib.html#populatedata
最后,我想说,对于您的问题,您必须定义一个“表模式”并处理CSV文件,以便获得与表模式完全相同的数据元素结构。

列中数据类型的定义在“表模式”的定义中完成。如果必须使用正确类型的数据(不是字符串,我想说)填充“数据表”,我将帮助您编写从CSV提取数据的代码,这很简单


目前,我希望所有这些都是正确的,并将有助于

我第一次使用正则表达式利用CSV文件,但是由于文件中的数据在每一行中的排列都非常严格,我们可以简单地使用split()函数

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    def transf(surname,x,y):
        return (surname,float(x),float(y))

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
    # to populate the data table by iterating in the CSV file
或者没有要定义的函数:

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )    
    # to populate the data table by iterating in the CSV file
有一段时间,我认为我必须一次用一行填充数据表,因为我使用的是正则表达式,需要在浮动数字字符串之前获得匹配的组。使用split()可以使用LoadData()在一条指令中完成所有操作

因此,您的代码可以缩短。顺便说一下,我不明白为什么它应该继续定义一个类。相反,一个函数对我来说已经足够了:

def GvizFromCsv(filename):
  """ creates a gviz data table from a CSV file """

  data_table = gviz_api.DataTable([('col1','string','SURNAME'),
                                   ('col2','number','ONE'    ),
                                   ('col3','number','TWO'    ) ])

  #  --- with such a table schema , lines in the file must be like that: ---  
  #  blah, number, number, ...anything else...\n 
  #  SMITH,1.006,1.006, ...anything else...\n 
  #  JOHNSON,0.810,1.816, ...anything else...\n 
  #  WILLIAMS,0.699,2.515, ...anything else...\n

  with open(filename) as f:
    data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
                         for line in f )
  return data_table


现在,您必须检查从另一个API读取CSV数据的方式是否可以插入到该代码中,以保持迭代原则来填充数据表。

如何处理该数据?一行接一行地处理,或者将它们放入一些数据结构中?除了固定类型之外,我不打算做其他行处理。结果将被用作构建gviz数据表的数据-我现在将此添加到问题中!非常有用的回复,非常感谢。一些注意事项:1)我这么做的唯一原因是因为我认为我需要在将CSV发送到gviz之前对其进行处理。我不认为gviz作为DS处理CSV文件。我目前正在尝试:csv->DataTable.toJsonResponse()。从你说的话来看,我可能根本不需要这样做——现在就去看戏。我的限制在于CSV必须通过python输入,但我很乐意将其作为gviz数据源输出。2) 如果我使用CSV作为数据源,并在此基础上执行查询,它会找出类型(用于求和等)?我不知道是否可以使用
import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )    
    # to populate the data table by iterating in the CSV file
def GvizFromCsv(filename):
  """ creates a gviz data table from a CSV file """

  data_table = gviz_api.DataTable([('col1','string','SURNAME'),
                                   ('col2','number','ONE'    ),
                                   ('col3','number','TWO'    ) ])

  #  --- with such a table schema , lines in the file must be like that: ---  
  #  blah, number, number, ...anything else...\n 
  #  SMITH,1.006,1.006, ...anything else...\n 
  #  JOHNSON,0.810,1.816, ...anything else...\n 
  #  WILLIAMS,0.699,2.515, ...anything else...\n

  with open(filename) as f:
    data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
                         for line in f )
  return data_table