在pandas/python中高效解析时间格式_Python_Pandas

在pandas/python中高效解析时间格式

python pandas

在pandas/python中高效解析时间格式,python,pandas,Python,Pandas,我有一个如下的数据帧 +-----------+------+--------------+ | invoiceNo | time | invoiceValue | +-----------+------+--------------+ | A | 6 | 2 | +-----------+------+--------------+ | B | 12 | 3 | +-----------+------+---

我有一个如下的数据帧

+-----------+------+--------------+
| invoiceNo | time | invoiceValue |
+-----------+------+--------------+
|     A     |   6  |       2      |
+-----------+------+--------------+
|     B     |  12  |       3      |
+-----------+------+--------------+
|     C     |  356 |       5      |
+-----------+------+--------------+
|     D     | 2145 |       6      |
+-----------+------+--------------+

df = pd.DataFrame({'invoiceNo':['A','B','C','D'],
             'time':[6,12,356,2145],
             'invoiceValue':[2,3,5,6] })

我的任务是从

time

值中提取相应的

hour

但是，问题是，理想情况下，

time

列应显示

4位。但是，由于数字格式的原因，它去掉了前导零。因此，6
在这里表示0006
，即00小时
&06分钟

为了实现这一点，我编写了下面的代码，它工作得非常好
df['adj-time'] = df['time'].apply(lambda x: '{0:0>4}'.format(x))
df['adj-time'] = df['adj-time'].apply(lambda x: pd.to_datetime(x,format= '%H%M'))
df['hour'] = df['adj-time'].apply(lambda x: x.hour)
df.drop('adj-time',axis=1, inplace=True)

下面是我想要的输出
+-----------+------+--------------+------+
| invoiceNo | time | invoiceValue | hour |
+-----------+------+--------------+------+
|     A     |   6  |       2      |   0  |
+-----------+------+--------------+------+
|     B     |  12  |       3      |   0  |
+-----------+------+--------------+------+
|     C     |  356 |       5      |   3  |
+-----------+------+--------------+------+
|     D     | 2145 |       6      |  21  |
+-----------+------+--------------+------+

然而，我的问题是，当涉及到大型数据集时，上面的代码非常慢，而且非常耗时
如何提高性能/速度方面的效率？
如果时间是整数，则：
hour = int(time/100)

如果是字符串：
hour = int(int(time)/100)

如果时间是整数，则：
hour = int(time/100)

如果是字符串：
hour = int(int(time)/100)

使用字符串操作提取小时数zfill
最多4个字符（如果还有秒，则为6个），然后将前2个字符切片以获得小时数（分钟为[2:4]，秒为[4:6]）。使用pd.to_numeric
获取数字数据类型
df['hour'] = pd.to_numeric(df['time'].astype(str).str.zfill(4).str[0:2])
df['minutes'] = pd.to_numeric(df['time'].astype(str).str.zfill(4).str[2:4])

  invoiceNo  time  invoiceValue  hour  minutes
0         A     6             2     0        6
1         B    12             3     0       12
2         C   356             5     3       56
3         D  2145             6    21       45


如果您有兴趣将'time'
转换为timedelta64[ns]
dtype，您可以使用pd.to\u datetime
的灵活解析。由于缺少年/月/日，因此默认值为1900-01-01，我们将其减去
df['new_time'] = (pd.to_datetime(df['time'].astype(str).str.zfill(4), format='%H%M')
                  - pd.to_datetime('1900-01-01'))

  invoiceNo  time  invoiceValue  hour  minutes        new_time
0         A     6             2     0        6 0 days 00:06:00
1         B    12             3     0       12 0 days 00:12:00
2         C   356             5     3       56 0 days 03:56:00
3         D  2145             6    21       45 0 days 21:45:00

使用字符串操作提取小时数zfill
最多4个字符（如果还有秒，则为6个），然后将前2个字符切片以获得小时数（分钟为[2:4]，秒为[4:6]）。使用pd.to_numeric
获取数字数据类型
df['hour'] = pd.to_numeric(df['time'].astype(str).str.zfill(4).str[0:2])
df['minutes'] = pd.to_numeric(df['time'].astype(str).str.zfill(4).str[2:4])

  invoiceNo  time  invoiceValue  hour  minutes
0         A     6             2     0        6
1         B    12             3     0       12
2         C   356             5     3       56
3         D  2145             6    21       45


如果您有兴趣将'time'
转换为timedelta64[ns]
dtype，您可以使用pd.to\u datetime
的灵活解析。由于缺少年/月/日，因此默认值为1900-01-01，我们将其减去
df['new_time'] = (pd.to_datetime(df['time'].astype(str).str.zfill(4), format='%H%M')
                  - pd.to_datetime('1900-01-01'))

  invoiceNo  time  invoiceValue  hour  minutes        new_time
0         A     6             2     0        6 0 days 00:06:00
1         B    12             3     0       12 0 days 00:12:00
2         C   356             5     3       56 0 days 03:56:00
3         D  2145             6    21       45 0 days 21:45:00


同时使用zfill
将'time'
设置为字符串，转换为日期时间并提取小时组件

df['hour']=pd.to_datetime（df.time.astype（'str'）.str.zfill（4），格式='%H%M'）.dt.hour
#显示（df）
发票无时间发票价值小时
0 A 6 2 0
1 B 12 3 0
2 C 356 5 3
三维2145621

从csv读取

在中读取数据时设置'time'
列的类型，然后设置。不需要astype（'str'）

df=pd.read\u csv（'test.csv'，dtype={'time'：str}）
df['hour']=pd.to_datetime（df.time.str.zfill（4），格式='%H%M'）.dt.hour

timeit测试
#200万行数据
df=pd.DataFrame（{'time'：[6,123562145]}）
dft=pd.concat（[df]*500000）。重置索引（drop=True）
%%时间
pd.to_datetime（dft.time.astype（'str'）.str.zfill（4），格式='%H%M'）.dt.hour
[out]：
每个回路1.51 s±23.2 ms（7次运行的平均值±标准偏差，每个回路1次）
%%时间
pd.to_numeric（dft.time.astype（str.str.zfill（4.str[0:2]））
[out]：
每个回路2.6 s±41.2 ms（7次运行的平均值±标准偏差，每个回路1次）

同时使用zfill
将'time'
设置为字符串，转换为日期时间并提取小时组件

df['hour']=pd.to_datetime（df.time.astype（'str'）.str.zfill（4），格式='%H%M'）.dt.hour
#显示（df）
发票无时间发票价值小时
0 A 6 2 0
1 B 12 3 0
2 C 356 5 3
三维2145621

从csv读取

在中读取数据时设置'time'
列的类型，然后设置。不需要astype（'str'）

df=pd.read\u csv（'test.csv'，dtype={'time'：str}）
df['hour']=pd.to_datetime（df.time.str.zfill（4），格式='%H%M'）.dt.hour

timeit测试
#200万行数据
df=pd.DataFrame（{'time'：[6,123562145]}）
dft=pd.concat（[df]*500000）。重置索引（drop=True）
%%时间
pd.to_datetime（dft.time.astype（'str'）.str.zfill（4），格式='%H%M'）.dt.hour
[out]：
每个回路1.51 s±23.2 ms（7次运行的平均值±标准偏差，每个回路1次）
%%时间
pd.to_numeric（dft.time.astype（str.str.zfill（4.str[0:2]））
[out]：
每个回路2.6 s±41.2 ms（7次运行的平均值±标准偏差，每个回路1次）
谢谢。我将上述解决方案应用于实际数据集。速度大大提高了。太棒了，谢谢你。我将上述解决方案应用于实际数据集。速度大大提高了。太棒了。我喜欢你的思维过程和简单的解决方案。我喜欢时间格式的原因是，我想让我的代码经得起未来的考验。意思是，如果我想在将来的分析中提取会议记录。我应该能很容易地做到。我对你的答案投了赞成票。这是一个很好的开箱思考。我喜欢你的思考过程和简单的解决方案。我喜欢时间格式的原因是，我想让我的代码经得起未来的考验。意思是，如果我想在将来的分析中提取会议记录。我应该能很容易地做到。我对你的答案投了赞成票。这是一个很好的开箱思考…太棒了。你的回答也大大提高了计算速度。因此，投票率上升。我已经接受了@ALollz的回答。非常感谢。知道了。我将这两种解决方案应用于数据集的一个子集。已拾取2.1Mn行
和42列
。这些列由不同的数据类型组成。你的解决方案快了2秒。太棒了。你的回答也大大提高了计算速度。因此，投票率上升。我已经接受了@ALollz的回答。非常感谢。知道了。我将这两种解决方案应用于数据集的一个子集。已拾取2.1Mn行
和42列
。这些列由不同的数据类型组成。您的解决方案快2秒。