Python 在应用另一个函数时更改列值
我在pandas中有一个数据帧,其中一列包含以字符串形式表示的时间间隔,如“P1Y4M1D” 整个CSV的示例如下:Python 在应用另一个函数时更改列值,python,pandas,Python,Pandas,我在pandas中有一个数据帧,其中一列包含以字符串形式表示的时间间隔,如“P1Y4M1D” 整个CSV的示例如下: oci,citing,cited,creation,timespan,journal_sc,author_sc 0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
我创建了一个解析函数,它接受字符串“P1Y4M1D”并返回一个整数。
我想知道如何使用该函数将所有列值更改为解析值
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
也许我甚至不需要用新的解析值更改整个列,最终目标是编写一个新函数,返回特定年份创建的文档的平均时间['timespan']。因为我需要解析的值,所以我认为更改整个列并操作新的数据帧会更容易
另外,我很好奇,在不修改数据帧的情况下,如何在每一行上应用解析函数,我只能假设它可能是这样的smth,但我不完全了解如何做到这一点:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
如何获取具有新值的列?非常感谢。和平:)A
df['timespan']。应用(解析)
(如@Dan所述)应该可以工作。您只需修改parse函数,即可将字符串作为参数接收,并在末尾返回解析后的字符串。大概是这样的:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])
df['timespan'].应用(解析)
?您应该更改parse
函数以处理单个值,即按原样使用时间跨度字符串,如'P1Y4M1D'
input@Dan谢谢!:)