Pandas HDF5查询中的算法
当我尝试对HDF5 where子句中的常量执行简单算术时,为什么会出现错误?下面是一个例子:Pandas HDF5查询中的算法,pandas,hdf5,pytables,Pandas,Hdf5,Pytables,当我尝试对HDF5 where子句中的常量执行简单算术时,为什么会出现错误?下面是一个例子: >>> import pandas >>> import numpy as np >>> d = pandas.DataFrame({"A": np.arange(10), "B": np.random.randint(1, 100, 10)}) >>> store = pandas.HDFStore('teststore.h5',
>>> import pandas
>>> import numpy as np
>>> d = pandas.DataFrame({"A": np.arange(10), "B": np.random.randint(1, 100, 10)})
>>> store = pandas.HDFStore('teststore.h5', mode='w')
>>> store.append('thingy', d, format='table', data_columns=True, append=False)
>>> store.select('thingy', where="B>50")
A B
0 0 61
1 1 63
6 6 80
7 7 79
8 8 52
9 9 82
>>> store.select('thingy', where="B>40+10")
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
store.select('thingy', where="B>40+10")
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 682, in select
return it.get_result()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 1365, in get_result
results = self.func(self.start, self.stop, where)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 675, in func
columns=columns, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4006, in read
if not self.read_axes(where=where, **kwargs):
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 3212, in read_axes
self.selection = Selection(self, where=where, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4527, in __init__
self.condition, self.filter = self.terms.evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 580, in evaluate
self.condition = self.terms.prune(ConditionBinOp)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 122, in prune
res = pr(left.value, right.prune(klass))
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 118, in prune
res = pr(left.value, right.value)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 113, in pr
encoding=self.encoding).evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 317, in evaluate
raise ValueError("query term is not valid [%s]" % self)
ValueError: query term is not valid [[Condition : [None]]]
这是怎么回事?这根本不受支持。我想它可能会失败,并发出一个稍微好一点的信息。它正在尝试和2个节点(比较和+10)进行比较,但不知道如何处理,因为这不是一个比较操作
我想它可以实现,但是IMHO在where子句中指定表达式是不必要的复杂不必要的复杂?如果底层的Pytables查询支持它,为什么pandas不能直接传递它呢?因为它们需要相当多的解析,这是一个AST转换;也许我会在列表中询问“只需传递”远不止这些,但是熊猫通过这样做而不仅仅是接受一个直接的Pytables查询字符串有什么好处呢?创建一个不如底层Pytables强大的自定义查询语法似乎有点反常。您必须使用正确的语法自己创建它,正确推断所有类型,提供类型转换,转换和运算符优先级,所以这非常重要,我不明白为什么要这么做。Pytables查询语法已经提供了所有运算符优先级和表达式处理的所有其余部分,并允许您查询列名和传递的condvar。我可以做
store.get\u storer('thingy').table.where(“B>X**2-49*X+10”,condvars={“X”:40})
,它工作得很好。
>>> for row in store.get_storer('thingy').table.where("B>40+10"):
... print(row[:])
(0L, 0, 61)
(1L, 1, 63)
(6L, 6, 80)
(7L, 7, 79)
(8L, 8, 52)
(9L, 9, 82)