Python 3.x 返回ValueError的日期单元格位置：（'；未知字符串格式：'；，'；2020-4-'；）_Python 3.x_Pandas_Datetime

Python 3.x 返回ValueError的日期单元格位置：（'；未知字符串格式：'；，'；2020-4-'；）

python-3.x pandas datetime

Python 3.x 返回ValueError的日期单元格位置：（'；未知字符串格式：'；，'；2020-4-'；）,python-3.x,pandas,datetime,Python 3.x,Pandas,Datetime,给定一个包含多个凌乱日期列的数据帧，如下所示： id date1 date2 date3 0 1 2020-5-10 2020-5-18 2021-5-17 1 2 2020.4.20 2020/5/20 2023/5/19 2 4 2020年5月7日 2020年5月20 2023年5月19 3 5 2020年4月23日 2020/5/1 2022/4/30 4 6 2020年4月12日

给定一个包含多个凌乱日期列的数据帧，如下所示：

  id       date1       date2       date3
0   1   2020-5-10   2020-5-18   2021-5-17
1   2   2020.4.20   2020/5/20   2023/5/19
2   4   2020年5月7日   2020年5月20   2023年5月19
3   5  2020年4月23日    2020/5/1   2022/4/30
4   6  2020年4月12日  2020年4月20日   2022/4/19
5   7   2020年5月8日   2020年5月8日   2022年5月8日
6   8   2020年5月3号   2020年5月8号   2022年5月3号
7  12  2020—05—10  2020—05—11  2021—05—10
8  13  2020—05—08  2020—05—15  2022—05—14
9  14  2020年3月15日  2020年3月15日  2023年3月14日

我编写了一个

date\u操作

函数，将它们转换为标准日期，格式为

%Y-%m-%d

def date_manipulate(x):
    x = x.str.strip()
    x = x.str.replace(' ', '')
    x = x.str.replace('年', '-')
    x = x.str.replace('月', '-')
    x = x.str.replace('日', '')
    x = x.str.replace('号', '')
    x = x.str.replace('.', '-')
    x = x.str.replace('—', '-')
    x = pd.to_datetime(x).dt.date.astype(str)
    return x

date_cols = ['date1', 'date2', 'date3']
df[date_cols].apply(date_manipulate)

但它会引发一个值错误，但我不知道原始excel中的哪个日期单元格会产生此错误：

ValueError: ('Unknown string format:', '2020-4-')

如何修改代码以返回日期单元格的位置以帮助检查？谢谢

@jezrael代码的输出：

                date1            date2              date3
1              2020.4.20  2020-05-20 00:00:00  2023-05-19 00:00:00
4             2020年4月23日  2020-05-01 00:00:00  2022-04-30 00:00:00
5             2020年4月12日           2020年4月20日  2022-04-19 00:00:00
16             2020年5月6日  2020-05-07 00:00:00            2022年5月8日
46   2020-03-20 00:00:00  2020-04-01 00:00:00  2022-03-31 00:00:00
48   2020-03-15 00:00:00  2020-03-20 00:00:00  2021-03-19 00:00:00
53   2020-04-01 00:00:00  2020-05-01 00:00:00  2025-04-30 00:00:00
54   2020-04-03 00:00:00  2020-04-03 00:00:00  2022-04-02 00:00:00
57   2020-04-14 00:00:00  2020-04-20 00:00:00  2021-04-19 00:00:00
58             2020年4月3日  2020-04-18 00:00:00           2022年4月17日
60   2020-04-30 00:00:00           2020年5月10号  2022-05-09 00:00:00
62             2020年5月7日  2020-05-06 00:00:00  2021-05-05 00:00:00
93             2020年5月2号  1900-01-05 02:52:48            2022年5-14
95             2020年5月5日           2020年5月10日  2022-05-09 00:00:00
96            2020年4月10日           2020年4月10日  2022-04-09 00:00:00
99   2020-05-11 00:00:00  2020-05-11 00:00:00  2022-05-10 00:00:00
121           2020年4月15号  2020-03-01 00:00:00           2023年2月28日
178           2020年4月30日              2020年4月  2022-02-28 00:00:00
180           2020年5月18日  2020-05-20 00:00:00  2022-05-19 00:00:00
186           2020年4月28日           2020年4月30日  2022-04-29 00:00:00
196           2020年5月18号  2020-05-20 00:00:00  2022-05-19 00:00:00
197           2020年3月18号           2020年3月18日  2022-02-28 00:00:00
231             2020-5-8  2020-05-08 00:00:00             2023-5-8

从

打印（df.loc[mask.any（axis=1），mask.any（）.reindex（df.columns，fill\u value=False）]到dict（'l'））

：

对于查找日期时间错误的行，可以稍微修改您的解决方案-在中添加

errors='concurve'

，以查找不可解析日期的缺失值，然后使用

any

测试缺失值：

print (df)
   id       date1       date2       date3
0   1   2020-5-10   2020-5-18   2021-5-17
1   2   2020.4.20   2020/5/20   2023/5/19
2   4   2020年5月7日   2020年5月20   2023年5月19
3   5  2020年4月23日    2020/5/1   2022/4/30
4   6  2020年4月12日  2020年4月20日   2022/4/19
5   7   2020年5月8日   2020年5月8日   2022年5月8日
6   8   2020年5月3号   2020年5月8号     2022年5月 <- error
7  12  2020—05—10  2020—05—11  2021—05—10
8  13  2020—05—08  2020—05—15  2022—05—14
9  14  2020年3月15日  2020年3月15日  2023年3月14日


def date_manipulate(x):
    x = x.str.strip()
    x = x.str.replace(' ', '')
    x = x.str.replace('年', '-')
    x = x.str.replace('月', '-')
    x = x.str.replace('日', '')
    x = x.str.replace('号', '')
    x = x.str.replace('.', '-')
    x = x.str.replace('—', '-')
    x = pd.to_datetime(x, errors='coerce')
    return x

您知道日期列中字符串的编码吗？这可能会使清洁变得更容易……对不起，我不知道，这些都是手工填写的日期，所以非常混乱。@ahbon-然后使用

2020年4.日18日和2020年4.月
@ahbon-正在处理它。@ahbon-我找到了原因-有日期时间，所以如果使用.str.strip（）
它将返回NaN
s:（当我评论.str.strip（）时，你是对的）
，它能解决问题。非常感谢。@ahbon-因为
是特殊的正则表达式字符，所以它能看到所有的值。所以有必要对它进行转义\。
才能像
print (df)
   id       date1       date2       date3
0   1   2020-5-10   2020-5-18   2021-5-17
1   2   2020.4.20   2020/5/20   2023/5/19
2   4   2020年5月7日   2020年5月20   2023年5月19
3   5  2020年4月23日    2020/5/1   2022/4/30
4   6  2020年4月12日  2020年4月20日   2022/4/19
5   7   2020年5月8日   2020年5月8日   2022年5月8日
6   8   2020年5月3号   2020年5月8号     2022年5月 <- error
7  12  2020—05—10  2020—05—11  2021—05—10
8  13  2020—05—08  2020—05—15  2022—05—14
9  14  2020年3月15日  2020年3月15日  2023年3月14日


def date_manipulate(x):
    x = x.str.strip()
    x = x.str.replace(' ', '')
    x = x.str.replace('年', '-')
    x = x.str.replace('月', '-')
    x = x.str.replace('日', '')
    x = x.str.replace('号', '')
    x = x.str.replace('.', '-')
    x = x.str.replace('—', '-')
    x = pd.to_datetime(x, errors='coerce')
    return x

def date_manipulate(x):
    x = x.replace([' ', '日', '号'], '', regex=True)
    x = x.replace(['年','月','—', '\.'], '-', regex=True)
    x = pd.to_datetime(x, errors='coerce')
    return x

date_cols = ['date1', 'date2', 'date3']

mask = df[date_cols].apply(date_manipulate).isna()
print (df.loc[mask.any(axis=1), mask.any().reindex(df.columns, fill_value=False)])

     date3
6  2022年5月