Python pandas to_datetime（）未检测列_Python_Pandas

Python pandas to_datetime（）未检测列

python pandas

Python pandas to_datetime（）未检测列,python,pandas,Python,Pandas,我有三列（h1、h2、h3），分别代表日、月和年，例如 import pandas as pd df = pd.DataFrame({ 'h1': [1,2,3], 'h2': [1,2,3], 'h3': [2000,2001,2002] }) 当我表演时： pd.to_datetime(df[['h1', 'h2', 'h3']]) 这将导致一个错误：ValueError:要组装映射，至少需要指定[year，month，day]，[day，month，year]

我有三列（h1、h2、h3），分别代表日、月和年，例如

import pandas as pd

df = pd.DataFrame({
    'h1': [1,2,3],
    'h2': [1,2,3],
    'h3': [2000,2001,2002]
})

当我表演时：

pd.to_datetime(df[['h1', 'h2', 'h3']])

这将导致一个错误：

ValueError:要组装映射，至少需要指定[year，month，day]，[day，month，year]缺失

，但是当我重命名列，然后执行pd.to\u datetime时，例如

df=df.rename(columns ={'h1':'day', 'h2':'month', 'h3': 'year'})
df["date_col"] =pd.to_datetime(df[['day','month','year']])

在它上面我得到了年度专栏，我们必须这样做吗？或者是否可以提供一种格式，以便可以分别将列检测为日、月、年？谢谢。

总结：正如文档所说，您重命名列的方法已经很聪明了：

例子

从数据帧的多个列组装日期时间。钥匙可以是常见的缩写，如['year'、'month'、'day'、'minute'， “second”、“ms”、“us”、“ns']）或相同的复数形式

但也有一些选择。根据我的经验，使用zip理解列表相当快（对于小集合）。大约有3000行数据，重命名列变得最快。从图中可以看出，重命名的惩罚对于较小的集合来说很难，但对于较大的集合则会得到补偿

选择 Win10的计时： MacBook Air的计时：

使用我编写的代码更新（如果您有改进建议或任何有帮助的库，我会很高兴）：

将熊猫作为pd导入
将numpy作为np导入
导入时间信息
将matplotlib.pyplot作为plt导入
从集合导入defaultdict
df=pd.DataFrame({
“h1”：np.arange（1,11），
“h2”：np.arange（1,11），
“h3”：np.arange（2000-2010）
})
myfuncs={
“pd.to_datetime（['-'.join（map（str，i））代表zip中的i（df['h3']、df['h2']、df['h1']））”：
lambda:pd.to_datetime（['-'.join（map（str，i））表示zip中的i（df['h3']、df['h2']、df['h1']），
“pd.to_datetime（['-'.join（i）for i in df[['h3'，'h2'，'h1']]）。values.astype（str）]”：
lambda:pd.to_datetime（['-'.join（i）for i in df[['h3'，'h2'，'h1']]].values.astype（str）]，
“pd.to_datetime（df['h1'，'h2'，'h3']]。重命名（列={'h1'：'day'，'h2'：'month'，'h3'：'year'}”）”：
lambda:pd.to_datetime（df['h1'，'h2'，'h3']].重命名（列={'h1'：'day'，'h2'：'month'，'h3'：'year'}））
}
d=默认dict（dict）
步骤=10
cont=真
尽管继续：
lendf=len（df）；打印（lendf）
对于mycodes.items（）中的k，v：
iters=1
t=0
当t<0.2时：
ts=时间。重复（v，数字=iters，重复=3）
t=最小值（ts）
iters*=10
d[k][lendf]=t/iters
如果t>2:cont=False
df=pd.concat（[df]*步）
pd.DataFrame（d.plot（）.legend（loc='上中'，bbox_至_锚=（0.5，-0.15））
plt.yscale（“log”）；plt.xscale（'log'）；plt.ylabel（“秒”）；plt.xlabel（'df行'）
plt.show（）

使用pandas library从这三个列（h1、h2、h3）创建日期，然后使用datetime方法当然可以，但我不想在创建之前重命名列。太好了，谢谢。这种理解很好。我已经用一种我认为更具可读性的方式格式化了你的问题。我还删除了我答案的一些部分，因为它们仅仅是副本。希望您能像我一样发现它更具可读性，并在将来应用它：）甚至更快：

pd.to_datetime（['-'.join（I）for I in df['h3'，'h2'，'h1']].values.astype（str）]

；灵感来源于@jpp漂亮的链接，但是我没有得到这些结果。大约慢3倍。用我的timings@jpp你觉得我添加的函数怎么样？@jezrael实际上我正试图改进这段代码。@jezrael很有趣，它很相似。

pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])])
pd.to_datetime(['-'.join(i) for i in df[['h3', 'h2', 'h1']].values.astype(str)])
df[['h3','h2','h1']].astype(str).apply(lambda x: pd.to_datetime('-'.join(x)), 1)
pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day', 'h2':'month','h3':'year'}))

#df = pd.concat([df]*1000)
2.74 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.08 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
158 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.64 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

100 loops, best of 3: 6.1 ms per loop
100 loops, best of 3: 12.7 ms per loop
1 loop, best of 3: 335 ms per loop
100 loops, best of 3: 4.7 ms per loop

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({
    'h1': np.arange(1,11),
    'h2': np.arange(1,11),
    'h3': np.arange(2000,2010)
})

myfuncs = {
"pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])])":
    lambda: pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])]),
"pd.to_datetime(['-'.join(i) for i in df[['h3','h2', 'h1']].values.astype(str)])":
    lambda: pd.to_datetime(['-'.join(i) for i in df[['h3','h2', 'h1']].values.astype(str)]),
"pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day','h2':'month','h3':'year'}))":
    lambda: pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day','h2':'month','h3':'year'}))
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in mycodes.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()