Python 熊猫过滤并将日期转换为datetime64ns
我正试图找出一个问题,但到目前为止,我找不到任何解决办法,我希望你能帮助我。 我有一个数据帧,我想将Python 熊猫过滤并将日期转换为datetime64ns,python,pandas,dataframe,hdf5,vaex,Python,Pandas,Dataframe,Hdf5,Vaex,我正试图找出一个问题,但到目前为止,我找不到任何解决办法,我希望你能帮助我。 我有一个数据帧,我想将str转换为datatime,但有一些无效行我想过滤掉。以下是两个例子: Out[6]: # name date 0 aa 2012-11-30T14:00:00+01:00 1 bb 2012-12-01T08:16:00+01:00 2 cc 2012-12-01T10:14:00+01:00 3 ee 2012-12
str
转换为datatime
,但有一些无效行我想过滤掉。以下是两个例子:
Out[6]:
# name date
0 aa 2012-11-30T14:00:00+01:00
1 bb 2012-12-01T08:16:00+01:00
2 cc 2012-12-01T10:14:00+01:00
3 ee 2012-12-01T11:05:00+01:00
4 gg 2012-12-01T11:05:00+01:00
In [7]: df2
Out[7]:
# name date
0 aa 2012-11-30T14:00:00+01:00
1 bb 2012-12-01T08:16:00+01:00
2 cc 2012-12-01T10:14:00+01:00
3 ee 2012-12-01T11:05:00+01:00
4 ff fsadfi2 2ih3ro
5 gg 2012-12-01T11:05:00+01:00
效果很好:
In [16]: df
Out[16]:
# name date pdate
0 aa 2012-11-30T14:00:00+01:00 2012-11-30 13:00:00.000000000
1 bb 2012-12-01T08:16:00+01:00 2012-12-01 07:16:00.000000000
2 cc 2012-12-01T10:14:00+01:00 2012-12-01 09:14:00.000000000
3 ee 2012-12-01T11:05:00+01:00 2012-12-01 10:05:00.000000000
4 gg 2012-12-01T11:05:00+01:00 2012-12-01 10:05:00.000000000
In [17]: df.dtypes
Out[17]:
name <class 'str'>
date <class 'str'>
pdate datetime64[ns]
dtype: object
它只有5行
。
现在我尝试转换,得到了一条很好的错误消息:
In [21]: df2_filtered['pdate']=df2_filtered.date.values.astype('datetime64[ns]')
...:
/usr/local/bin/ipython:1: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
#!/opt/local/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-563087d6f949> in <module>
----> 1 df2_filtered['pdate']=df2_filtered.date.values.astype('datetime64[ns]')
/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py in __setitem__(self, name, value)
4370 if isinstance(name, six.string_types):
4371 if isinstance(value, (np.ndarray, Column)):
-> 4372 self.add_column(name, value)
4373 else:
4374 self.add_virtual_column(name, value)
/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py in add_column(self, name, data, dtype)
5743 # self._length_original = len(data)
5744 # self._index_end = self._length_unfiltered
-> 5745 super(DataFrameArrays, self).add_column(name, data, dtype=dtype)
5746 self._length_unfiltered = int(round(self._length_original * self._active_fraction))
5747 # self.set_active_fraction(self._active_fraction)
/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py in add_column(self, name, f_or_array, dtype)
2872 # give a better warning to avoid confusion
2873 if len(self) == len(ar):
-> 2874 raise ValueError("Array is of length %s, while the length of the DataFrame is %s due to the filtering, the (unfiltered) length is %s." % (len(ar), len(self), self.length_unfiltered()))
2875 raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original()))
2876 # assert self.length_unfiltered() == len(data), "columns should be of equal length, length should be %d, while it is %d" % ( self.length_unfiltered(), len(data))
ValueError: Array is of length 5, while the length of the DataFrame is 5 due to the filtering, the (unfiltered) length is 6.
这似乎很有效,但当我尝试使用df2\u filtered
时,我得到了以下结果
In [57]: df2_filtered
Out[57]: ERROR:MainThread:vaex:error evaluating: pdate at rows 0-5
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
result = self[expression]
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 141, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: 'Unknown variables or column: "astype(date, \'datetime64[ns]\')"'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 3467, in table_part
values[name] = df.evaluate(name)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5038, in evaluate
dtype = dtypes[expression] = self.dtype(expression, internal=False)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 2005, in dtype
data = self.evaluate(expression, 0, 1, filtered=False, internal=True, parallel=False)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5143, in evaluate
value = scope.evaluate(expression)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
result = self[expression]
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 136, in __getitem__
self.values[variable] = self.evaluate(expression) # , out=self.buffers[variable])
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 100, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/functions.py", line 2106, in _astype
return x.astype(dtype)
AttributeError: 'ColumnStringArrow' object has no attribute 'astype'
ERROR:MainThread:vaex:error evaluating: pdate at rows 0-5
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
result = self[expression]
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 141, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: 'Unknown variables or column: "astype(date, \'datetime64[ns]\')"'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 3467, in table_part
values[name] = df.evaluate(name)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5038, in evaluate
dtype = dtypes[expression] = self.dtype(expression, internal=False)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 2005, in dtype
data = self.evaluate(expression, 0, 1, filtered=False, internal=True, parallel=False)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5143, in evaluate
value = scope.evaluate(expression)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
result = self[expression]
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 136, in __getitem__
self.values[variable] = self.evaluate(expression) # , out=self.buffers[variable])
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 100, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/functions.py", line 2106, in _astype
return x.astype(dtype)
AttributeError: 'ColumnStringArrow' object has no attribute 'astype'
# name date pdate
0 aa 2012-11-30T14:00:00+01:00 error
1 bb 2012-12-01T08:16:00+01:00 error
2 cc 2012-12-01T10:14:00+01:00 error
3 ee 2012-12-01T11:05:00+01:00 error
4 gg 2012-12-01T11:05:00+01:00 error
[57]中的:df2_已过滤
Out[57]:错误:主线程:vaex:错误评估:第0-5行的更新
回溯(最近一次呼叫最后一次):
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,评估中的第94行
结果=自我[表达]
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,第141行,在__
raise KeyError(“未知变量或列:%r”%(变量,))
KeyError:“未知变量或列:”aType(日期,\'datetime64[ns]\')”
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/dataframework.py”,第3467行,在表部分
值[名称]=df.evaluate(名称)
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py”,第5038行,在评估中
dtype=dtypes[expression]=self.dtype(expression,internal=False)
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/dataframe.py”,第2005行,数据类型
data=self.evaluate(表达式,0,1,filtered=False,internal=True,parallel=False)
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/dataframe.py”,第5143行,在评估中
value=scope.evaluate(表达式)
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,评估中的第94行
结果=自我[表达]
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,第136行,在__
self.values[variable]=self.evaluate(表达式)#,out=self.buffers[variable])
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,第100行,在评估中
结果=eval(表达式、表达式名称空间、self)
文件“”,第1行,在
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/functions.py”,第2106行,格式为
返回x.astype(dtype)
AttributeError:“ColumnStringArrow”对象没有属性“astype”
错误:主线程:vaex:错误评估:第0-5行的更新
回溯(最近一次呼叫最后一次):
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,评估中的第94行
结果=自我[表达]
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,第141行,在__
raise KeyError(“未知变量或列:%r”%(变量,))
KeyError:“未知变量或列:”aType(日期,\'datetime64[ns]\')”
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/dataframework.py”,第3467行,在表部分
值[名称]=df.evaluate(名称)
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py”,第5038行,在评估中
dtype=dtypes[expression]=self.dtype(expression,internal=False)
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/dataframe.py”,第2005行,数据类型
data=self.evaluate(表达式,0,1,filtered=False,internal=True,parallel=False)
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/dataframe.py”,第5143行,在评估中
value=scope.evaluate(表达式)
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,评估中的第94行
结果=自我[表达]
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,第136行,在__
self.values[variable]=self.evaluate(表达式)#,out=self.buffers[variable])
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site packages/vaex/scopes.py”,第100行,在评估中
结果=eval(表达式、表达式名称空间、self)
文件“”,第1行,在
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/functions.py”,第2106行,格式为
返回x.astype(dtype)
AttributeError:“ColumnStringArrow”对象没有属性“astype”
#姓名日期更新
0 aa 2012-11-30T14:00:00+01:00错误
bb 2012-12-01T08:16:00+01:00错误
2 cc 2012-12-01T10:14:00+01:00错误
3 ee 2012-12-01T11:05:00+01:00错误
4 gg 2012-12-01T11:05:00+01:00错误
IIUC,pd.to\u datetime
允许您使用某些关键字参数将列转换为datetime。在这种情况下,您需要errors='improve'
print(df)
name date
0 aa 2012-11-30T14:00:00+01:00
1 bb 2012-12-01T08:16:00+01:00
2 cc 2012-12-01T10:14:00+01:00
3 ee 2012-12-01T11:05:00+01:00
4 ff fsadfi22ih3ro
5 gg 2012-12-01T11:05:00+01:00
df['date'] = pd.to_datetime(df['date'],errors='coerce')
print(df)
name date
0 aa 2012-11-30 14:00:00+01:00
1 bb 2012-12-01 08:16:00+01:00
2 cc 2012-12-01 10:14:00+01:00
3 ee 2012-12-01 11:05:00+01:00
4 ff NaT
5 gg 2012-12-01 11:05:00+01:00
现在只需使用
.dropna()
删除行,同时对日期列进行子集设置
df.dropna(subset=['date'])
print(df)
name date
0 aa 2012-11-30 14:00:00+01:00
1 bb 2012-12-01 08:16:00+01:00
2 cc 2012-12-01 10:14:00+01:00
3 ee 2012-12-01 11:05:00+01:00
5 gg 2012-12-01 11:05:00+01:00
print(df.dtypes)
name object
date datetime64[ns, pytz.FixedOffset(60)]
dtype: object
很遗憾,我没有完整的答案,但我可能对你问题的这一部分有一个想法: 我不知道为什么df2中有多少行很重要 这很重要,因为据我所知,
vaex
通过存储操作来构造新列
df2_filtered['pdate']=df2_filtered.date.astype('datetime64[ns]')
In [57]: df2_filtered
Out[57]: ERROR:MainThread:vaex:error evaluating: pdate at rows 0-5
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
result = self[expression]
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 141, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: 'Unknown variables or column: "astype(date, \'datetime64[ns]\')"'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 3467, in table_part
values[name] = df.evaluate(name)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5038, in evaluate
dtype = dtypes[expression] = self.dtype(expression, internal=False)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 2005, in dtype
data = self.evaluate(expression, 0, 1, filtered=False, internal=True, parallel=False)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5143, in evaluate
value = scope.evaluate(expression)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
result = self[expression]
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 136, in __getitem__
self.values[variable] = self.evaluate(expression) # , out=self.buffers[variable])
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 100, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/functions.py", line 2106, in _astype
return x.astype(dtype)
AttributeError: 'ColumnStringArrow' object has no attribute 'astype'
ERROR:MainThread:vaex:error evaluating: pdate at rows 0-5
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
result = self[expression]
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 141, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: 'Unknown variables or column: "astype(date, \'datetime64[ns]\')"'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 3467, in table_part
values[name] = df.evaluate(name)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5038, in evaluate
dtype = dtypes[expression] = self.dtype(expression, internal=False)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 2005, in dtype
data = self.evaluate(expression, 0, 1, filtered=False, internal=True, parallel=False)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5143, in evaluate
value = scope.evaluate(expression)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
result = self[expression]
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 136, in __getitem__
self.values[variable] = self.evaluate(expression) # , out=self.buffers[variable])
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 100, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/functions.py", line 2106, in _astype
return x.astype(dtype)
AttributeError: 'ColumnStringArrow' object has no attribute 'astype'
# name date pdate
0 aa 2012-11-30T14:00:00+01:00 error
1 bb 2012-12-01T08:16:00+01:00 error
2 cc 2012-12-01T10:14:00+01:00 error
3 ee 2012-12-01T11:05:00+01:00 error
4 gg 2012-12-01T11:05:00+01:00 error
print(df)
name date
0 aa 2012-11-30T14:00:00+01:00
1 bb 2012-12-01T08:16:00+01:00
2 cc 2012-12-01T10:14:00+01:00
3 ee 2012-12-01T11:05:00+01:00
4 ff fsadfi22ih3ro
5 gg 2012-12-01T11:05:00+01:00
df['date'] = pd.to_datetime(df['date'],errors='coerce')
print(df)
name date
0 aa 2012-11-30 14:00:00+01:00
1 bb 2012-12-01 08:16:00+01:00
2 cc 2012-12-01 10:14:00+01:00
3 ee 2012-12-01 11:05:00+01:00
4 ff NaT
5 gg 2012-12-01 11:05:00+01:00
df.dropna(subset=['date'])
print(df)
name date
0 aa 2012-11-30 14:00:00+01:00
1 bb 2012-12-01 08:16:00+01:00
2 cc 2012-12-01 10:14:00+01:00
3 ee 2012-12-01 11:05:00+01:00
5 gg 2012-12-01 11:05:00+01:00
print(df.dtypes)
name object
date datetime64[ns, pytz.FixedOffset(60)]
dtype: object
df2_filtered=df2[df2['date'].str.contains(':00')]
df2_filtered['pdate']=df2_filtered.date.values.astype('datetime64[ns]')
# Adds a numpy arrays to the dataframe
df2_filtered['pdate'] = df2_filtered.date.values.astype('datetime64[ns]')
# Adds a virtual column (backed by an expression) to the dataframe
# at zero memory cost
df2_filtered['pdate']=df2_filtered.date.astype('datetime64[ns]')
print(df2.nbytes)
648
print(df2_filtered.drop_filter())
# name date pdate
0 aa 2012-11-30T14:00:00+01:00 2012-11-30 13:00:00.000000000
1 bb 2012-12-01T08:16:00+01:00 2012-12-01 07:16:00.000000000
2 cc 2012-12-01T10:14:00+01:00 2012-12-01 09:14:00.000000000
3 ee 2012-12-01T11:05:00+01:00 2012-12-01 10:05:00.000000000
4 ff fsadfi2 2ih3ro NaT
5 gg 2012-12-01T11:05:00+01:00 2012-12-01 10:05:00.000000000