Python 从数据帧字符串计算已用天数_Python_Python 2.7_Pandas

Python 从数据帧字符串计算已用天数

python python-2.7 pandas

Python 从数据帧字符串计算已用天数,python,python-2.7,pandas,Python,Python 2.7,Pandas,我有一个存储人们旅行日期的数据框。我想添加一列，显示停留时间。为此，需要解析字符串，将其转换为日期时间并进行减法Pandas似乎将datetime转换视为一个完整的系列，而不是单个的字符串作为一个I get类型错误：必须是字符串，而不是系列。我喜欢使用非循环选项，因为实际的数据集相当大，但需要一些帮助 import pandas as pd from datetime import datetime df = pd.DataFrame(data=[['Bob', '12 Mar 2015 -

我有一个存储人们旅行日期的数据框。我想添加一列，显示停留时间。为此，需要解析

字符串

，将其转换为

日期时间

并进行减法

Pandas

似乎将

datetime

转换视为一个完整的系列，而不是单个的

字符串

作为一个I get

类型错误：必须是字符串，而不是系列

。我喜欢使用非循环选项，因为实际的数据集相当大，但需要一些帮助

import pandas as pd
from datetime import datetime

df = pd.DataFrame(data=[['Bob', '12 Mar 2015 - 31 Mar 2015'], ['Jessica', '27 Mar 2015 - 31 Mar 2015']], columns=['Names', 'Day of Visit'])
df['Length of Stay'] = (datetime.strptime(df['Day of Visit'][:11], '%d %b %Y') - datetime.strptime(df['Day of Visit'][-11:], '%d %b %Y')).days + 1
print df

期望输出：

    Names               Day of Visit  Length of Stay
0      Bob  12 Mar 2015 - 31 Mar 2015              20
1  Jessica  27 Mar 2015 - 31 Mar 2015               5

使用

Series.str.extract

将

就诊日期

列拆分为两个单独的列。然后使用

pd.to_datetime

将列解析为日期。然后，通过减去日期列并添加1，可以计算

停留时间

：

import numpy as np
import pandas as pd

df = pd.DataFrame(data=[['Bob', '12 Mar 2015 - 31 Mar 2015'], ['Jessica', '27 Mar 2015 - 31 Mar 2015']], columns=['Names', 'Day of Visit'])
tmp = df['Day of Visit'].str.extract(r'([^-]+)-(.*)', expand=True).apply(pd.to_datetime)
df['Length of Stay'] = (tmp[1] - tmp[0]).dt.days + 1
print(df)

屈服

     Names               Day of Visit  Length of Stay
0      Bob  12 Mar 2015 - 31 Mar 2015              20
1  Jessica  27 Mar 2015 - 31 Mar 2015               5

（[^-]+）-（.*）

表示

(              # start group #1
  [            # begin character class
    ^-         # any character except a literal minus sign `-`
  ]            # end character class 
   +           # match 1-or-more characters from the character class
)              # end group #1
-              # match a literal minus sign 
(              # start group #2
  .*           # match 0-or-more of any character
)              # end group #2

.str.extract

返回一个数据框，其中包含列中第1组和第2组的匹配文本。

使用

Series.str.extract

将

就诊日期

列拆分为两个单独的列。然后使用

pd.to_datetime

将列解析为日期。然后，通过减去日期列并添加1，可以计算

停留时间

：

import numpy as np
import pandas as pd

df = pd.DataFrame(data=[['Bob', '12 Mar 2015 - 31 Mar 2015'], ['Jessica', '27 Mar 2015 - 31 Mar 2015']], columns=['Names', 'Day of Visit'])
tmp = df['Day of Visit'].str.extract(r'([^-]+)-(.*)', expand=True).apply(pd.to_datetime)
df['Length of Stay'] = (tmp[1] - tmp[0]).dt.days + 1
print(df)

屈服

     Names               Day of Visit  Length of Stay
0      Bob  12 Mar 2015 - 31 Mar 2015              20
1  Jessica  27 Mar 2015 - 31 Mar 2015               5

（[^-]+）-（.*）

表示

(              # start group #1
  [            # begin character class
    ^-         # any character except a literal minus sign `-`
  ]            # end character class 
   +           # match 1-or-more characters from the character class
)              # end group #1
-              # match a literal minus sign 
(              # start group #2
  .*           # match 0-or-more of any character
)              # end group #2

.str.extract

返回一个数据帧，其中包含列中组1和组2的匹配文本。

解决方案

谢谢我需要研究正则表达式，因为在我看来它像是胡言乱语。如果我不想使用helper列，

开始

和

结束

，最好在使用后立即删除。我已经修改了代码，将helper列放置在

tmp

中。谢谢。我需要研究正则表达式，因为在我看来它像是胡言乱语。如果我不想使用helper列，

开始

和

结束

，最好在使用后立即删除？我修改了代码，将helper列放置在

tmp

中。