Python 在调用pandas到_datetime（或dateutils）时获取输入精度？_Python_Pandas

Python 在调用pandas到_datetime（或dateutils）时获取输入精度？

python pandas

Python 在调用pandas到_datetime（或dateutils）时获取输入精度？,python,pandas,Python,Pandas,我有一个日期字符串数组，例如：[“1999-2-4”、“1989-2”、“2020”、“1914/09/01”] 我正在用pandas'to_datetime将这些字符串转换为时间戳但是我得到了一个标准的熊猫约会时间，精确到ns。我还需要知道字符串的原始精度是多少（即上面数组的[日、月、年、日]）我最初尝试的是设置一个与精度数组匹配的格式数组： 1:[%Y-%M-%D'、%Y/%M/%D'、%Y-%M'、%Y'] 2:[“日”、“日”、“年”、“月”] 我计划简单地按顺序尝试每种格式，直到其

我有一个日期字符串数组，例如：

[“1999-2-4”、“1989-2”、“2020”、“1914/09/01”]

我正在用pandas'to_datetime将这些字符串转换为时间戳

但是我得到了一个标准的熊猫约会时间，精确到ns。我还需要知道字符串的原始精度是多少（即上面数组的[日、月、年、日]）

我最初尝试的是设置一个与精度数组匹配的格式数组：

1:[%Y-%M-%D'、%Y/%M/%D'、%Y-%M'、%Y']

2:[“日”、“日”、“年”、“月”]

我计划简单地按顺序尝试每种格式，直到其中一种有效，然后再计算匹配精度

但是，不幸的是（出于我的目的），使用format=“%Y-%M-%D”传递给to_datetime的类似“1999”的输入，即使使用exact=True，也将成功解析。于是就有了一个依靠试着在一个环中捕捉的计划

我需要一些方法来获得原始精度。这对熊猫有可能吗？或者，dateutils是否可能实现这一点？

出现的一个核心问题是：您计划以后如何考虑精度方面的信息

在您的情况下（也考虑到带有可选前导零的日期和月份格式的差异），我将采用一种方法，首先获取各个日期组件（

年

，

月

，

日

），然后将它们组合起来


def parse_date(s):
    date_entries = s["date"].split("-")
    s["year"] = int(date_entries[0]) if len(date_entries) > 0 else None
    s["month"] = int(date_entries[1]) if len(date_entries) > 1 else None
    s["day"] = int(date_entries[2]) if len(date_entries) > 2 else None
    return s

dates = ["1999-2-4", "1989-2", "2020", "1914-09-01"]
pd.DataFrame(dates, columns=["date"]).apply(parse_date, axis=1)

输出：

      date      year    month   day
0   1999-2-4    1999    2.0     4.0
1   1989-2      1989    2.0     NaN
2   2020        2020    NaN     NaN
3   1914-09-01  1914    9.0     1.0

    year    month   day
0   1999    2        4
1   1989    2       NaN
2   2020    NaN     NaN
3   1914    09       01

    date        Year    Month   Day
0   1999-2-4    1999    2        4
1   1989-2      1989    2       None
2   2020        2020    None    None
3   1914/09/01  1914    09      01

    Year    Month   Day
0   1999    2        4
1   1989    2       None
2   2020    None    None
3   1914    09       01

请注意，

年

、

月

和

日

将是

np.float

（假设存在缺失值）。您可以将精度的具体计算添加到

parse_date

-函数中，还可以根据需要在新列中组合它们

或者，也可以使用

.str.extract

提供正则表达式：

df = pd.DataFrame(dates, columns=["date"])
df["date"].str.extract("(?P<year>[0-9]{4})-?(?P<month>[01]?[0-9])?-?(?P<day>[0-3]?[0-9])?")

由此产生的一个核心问题是：您计划以后如何考虑精度方面的信息

在您的情况下（也考虑到带有可选前导零的日期和月份格式的差异），我将采用一种方法，首先获取各个日期组件（

年

，

月

，

日

），然后将它们组合起来


def parse_date(s):
    date_entries = s["date"].split("-")
    s["year"] = int(date_entries[0]) if len(date_entries) > 0 else None
    s["month"] = int(date_entries[1]) if len(date_entries) > 1 else None
    s["day"] = int(date_entries[2]) if len(date_entries) > 2 else None
    return s

dates = ["1999-2-4", "1989-2", "2020", "1914-09-01"]
pd.DataFrame(dates, columns=["date"]).apply(parse_date, axis=1)

输出：

      date      year    month   day
0   1999-2-4    1999    2.0     4.0
1   1989-2      1989    2.0     NaN
2   2020        2020    NaN     NaN
3   1914-09-01  1914    9.0     1.0

    year    month   day
0   1999    2        4
1   1989    2       NaN
2   2020    NaN     NaN
3   1914    09       01

    date        Year    Month   Day
0   1999-2-4    1999    2        4
1   1989-2      1989    2       None
2   2020        2020    None    None
3   1914/09/01  1914    09      01

    Year    Month   Day
0   1999    2        4
1   1989    2       None
2   2020    None    None
3   1914    09       01

请注意，

年

、

月

和

日

将是

np.float

（假设存在缺失值）。您可以将精度的具体计算添加到

parse_date

-函数中，还可以根据需要在新列中组合它们

或者，也可以使用

.str.extract

提供正则表达式：

df = pd.DataFrame(dates, columns=["date"])
df["date"].str.extract("(?P<year>[0-9]{4})-?(?P<month>[01]?[0-9])?-?(?P<day>[0-3]?[0-9])?")

在我看来，这不是最好的办法。在您可以实现的情况下，不应将Try-catch用于程序的控制流。为什么不根据输入使用精确格式。差不多

def get_format(input):
   if input.count('-') == 0:
      return "%Y"
   if input.count('-') == 1:
      return "%Y-%M"
   if input.count('-') == 2:
      return "%Y-%M-%D"
   if input.count('/') == 2:
      return "%Y/%M/%D"


input = ["1999-2-4", "1989-2", "2020", "1914-09-01"]

results = [x.to_datetime(format=get_format(x)) for x in input]

或者，如果你可能有更多的格式，试着从我的角度阅读，这不是最好的方法。在您可以实现的情况下，不应将Try-catch用于程序的控制流。为什么不根据输入使用精确格式。差不多

def get_format(input):
   if input.count('-') == 0:
      return "%Y"
   if input.count('-') == 1:
      return "%Y-%M"
   if input.count('-') == 2:
      return "%Y-%M-%D"
   if input.count('/') == 2:
      return "%Y/%M/%D"


input = ["1999-2-4", "1989-2", "2020", "1914-09-01"]

results = [x.to_datetime(format=get_format(x)) for x in input]

或者，如果您可能有更多格式，请尝试阅读此代码。您可以在

get_dict（）

函数中添加任何类型的年份参数，如

、

等

import pandas as pd
import re


def get_dict（日期）：
dic_列表=[]
对于d in日期：
dic={}
列表=重新拆分（'-|/'，d）
dic['date']=d
dic['Year']=（列表[0]）如果len（列表[0]）大于0，则无
dic['Month']=（列表[1]）如果len（列表[1]）大于1，则无
dic['Day']=（列表[2]）如果len（列表[2]）大于2，则无
dic_列表追加（dic）
返回dic_列表

输出：

      date      year    month   day
0   1999-2-4    1999    2.0     4.0
1   1989-2      1989    2.0     NaN
2   2020        2020    NaN     NaN
3   1914-09-01  1914    9.0     1.0

    year    month   day
0   1999    2        4
1   1989    2       NaN
2   2020    NaN     NaN
3   1914    09       01

    date        Year    Month   Day
0   1999-2-4    1999    2        4
1   1989-2      1989    2       None
2   2020        2020    None    None
3   1914/09/01  1914    09      01

    Year    Month   Day
0   1999    2        4
1   1989    2       None
2   2020    None    None
3   1914    09       01

使用

iloc

df.iloc[：，1:]

输出：

      date      year    month   day
0   1999-2-4    1999    2.0     4.0
1   1989-2      1989    2.0     NaN
2   2020        2020    NaN     NaN
3   1914-09-01  1914    9.0     1.0

    year    month   day
0   1999    2        4
1   1989    2       NaN
2   2020    NaN     NaN
3   1914    09       01

    date        Year    Month   Day
0   1999-2-4    1999    2        4
1   1989-2      1989    2       None
2   2020        2020    None    None
3   1914/09/01  1914    09      01

    Year    Month   Day
0   1999    2        4
1   1989    2       None
2   2020    None    None
3   1914    09       01

签出此代码。您可以在

get_dict（）

函数中添加任何类型的年份参数，如

、

等

import pandas as pd
import re


def get_dict（日期）：
dic_列表=[]
对于d in日期：
dic={}
列表=重新拆分（'-|/'，d）
dic['date']=d
dic['Year']=（列表[0]）如果len（列表[0]）大于0，则无
dic['Month']=（列表[1]）如果len（列表[1]）大于1，则无
dic['Day']=（列表[2]）如果len（列表[2]）大于2，则无
dic_列表追加（dic）
返回dic_列表

输出：

      date      year    month   day
0   1999-2-4    1999    2.0     4.0
1   1989-2      1989    2.0     NaN
2   2020        2020    NaN     NaN
3   1914-09-01  1914    9.0     1.0

    year    month   day
0   1999    2        4
1   1989    2       NaN
2   2020    NaN     NaN
3   1914    09       01

    date        Year    Month   Day
0   1999-2-4    1999    2        4
1   1989-2      1989    2       None
2   2020        2020    None    None
3   1914/09/01  1914    09      01

    Year    Month   Day
0   1999    2        4
1   1989    2       None
2   2020    None    None
3   1914    09       01

使用

iloc

df.iloc[：，1:]

输出：

      date      year    month   day
0   1999-2-4    1999    2.0     4.0
1   1989-2      1989    2.0     NaN
2   2020        2020    NaN     NaN
3   1914-09-01  1914    9.0     1.0

    year    month   day
0   1999    2        4
1   1989    2       NaN
2   2020    NaN     NaN
3   1914    09       01

    date        Year    Month   Day
0   1999-2-4    1999    2        4
1   1989-2      1989    2       None
2   2020        2020    None    None
3   1914/09/01  1914    09      01

    Year    Month   Day
0   1999    2        4
1   1989    2       None
2   2020    None    None
3   1914    09       01

谢谢不幸的是，实际的日期格式并不是由破折号统一分隔的——可以是/，也可以是其他东西。任何可解析格式。（我对问题进行了编辑，以便更清楚地说明这一点）目前，我不需要对精度数据做任何处理，只需要存储它。（人类将读取输出并需要存储在那里的精度信息）@dWitty：我将尝试预先计算出现有的

datetime

格式，并相应地调整正则表达式（或根据第一个示例应用的函数）。尤其是当您对输入格式不确定时，请尝试防御性地编写代码（例如，如果您不确定某个日期是“月”还是“日”主要日期，只采用正确的格式是否明智？）。这里没有“通用”的最佳解决方案-也许您可以根据其他值推断有关格式的信息？谢谢。不幸的是，实际的日期格式并不是由破折号统一分隔的——可以是/，也可以是其他东西。任何可解析格式。（我对问题进行了编辑，以便更清楚地说明这一点）目前，我不需要对精度数据做任何处理，只需要存储它。（人类将读取输出并需要存储在那里的精度信息）@dWitty：我将尝试预先计算现有的datetime
format
s，并调整正则表达式（或根据第一个示例应用的函数）accor