Python 在应用另一个函数时更改列值_Python_Pandas

Python 在应用另一个函数时更改列值

python pandas

Python 在应用另一个函数时更改列值,python,pandas,Python,Pandas,我在pandas中有一个数据帧，其中一列包含以字符串形式表示的时间间隔，如“P1Y4M1D” 整个CSV的示例如下： oci,citing,cited,creation,timespan,journal_sc,author_sc 0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.

我在pandas中有一个数据帧，其中一列包含以字符串形式表示的时间间隔，如“P1Y4M1D”

整个CSV的示例如下：

oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...

我创建了一个解析函数，它接受字符串“P1Y4M1D”并返回一个整数。我想知道如何使用该函数将所有列值更改为解析值

def do_process_citation_data(f_path):
    global my_ocan

    my_ocan = pd.read_csv("citations.csv",
                          names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
                          parse_dates=['creation', 'timespan'])
    my_ocan = my_ocan.iloc[1:]  # to remove the first row iloc - to select data by row numbers
    my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)


    return my_ocan


def parse():
     mydict = dict()
     mydict2 = dict()
     i = 1
     r = 1
     for x in my_ocan['oci']:
        mydict[x] = str(my_ocan['timespan'][i])
        i +=1
     print(mydict)
     for key, value in mydict.items():
        is_negative = value.startswith('-')
        if is_negative:
            date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
        else:
            date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
        year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
        daystotal = (year * 365) + (month * 30) + day
        if not is_negative:
            #mydict2[key] = daystotal
            return daystotal
        else:
           #mydict2[key] = -daystotal
            return -daystotal
     #print(mydict2)
     #return mydict2

也许我甚至不需要用新的解析值更改整个列，最终目标是编写一个新函数，返回特定年份创建的文档的平均时间['timespan']。因为我需要解析的值，所以我认为更改整个列并操作新的数据帧会更容易

另外，我很好奇，在不修改数据帧的情况下，如何在每一行上应用解析函数，我只能假设它可能是这样的smth，但我不完全了解如何做到这一点：

      for x in my_ocan['timespan']:
          x = parse(str(my_ocan['timespan'])

如何获取具有新值的列？非常感谢。和平：）

df['timespan']。应用（解析）

（如@Dan所述）应该可以工作。您只需修改parse函数，即可将字符串作为参数接收，并在末尾返回解析后的字符串。大概是这样的：

import pandas as pd

def parse_postal_code(postal_code):
    # Splitting postal code and getting first letters
    letters = postal_code.split('_')[0]
    return letters


# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})

# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))

# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)

print(df['Postal Code Letter'])

df['timespan'].应用（解析）

？您应该更改

parse

函数以处理单个值，即按原样使用时间跨度字符串，如

'P1Y4M1D'

input@Dan谢谢！：）