Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/jquery-ui/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 正确的数据操作脚本布局,所有合并、删除、聚合和重命名都易于跟踪和查看_Python_Python 3.x_Database_Pandas_Dataframe - Fatal编程技术网

Python 正确的数据操作脚本布局,所有合并、删除、聚合和重命名都易于跟踪和查看

Python 正确的数据操作脚本布局,所有合并、删除、聚合和重命名都易于跟踪和查看,python,python-3.x,database,pandas,dataframe,Python,Python 3.x,Database,Pandas,Dataframe,我目前有一个很长的脚本,它有一个目标:获取多个csv数据表,将它们合并为一个,同时执行各种计算,然后输出最终的csv表 我最初有这个布局(请参阅布局A),但发现这使得它必须查看添加或合并了哪些列,因为清理和操作方法列在所有内容的下面,因此您必须在文件中上下查看表是如何更改的。这是为了遵循我所读到的“保持事物模块化和小型化”的整个方法: import pandas as pd #... SOME_MAPPER = {'a':1, 'b':2, ...} COLUMNS_RENAMER = {'n

我目前有一个很长的脚本,它有一个目标:获取多个csv数据表,将它们合并为一个,同时执行各种计算,然后输出最终的csv表

我最初有这个布局(请参阅布局A),但发现这使得它必须查看添加或合并了哪些列,因为清理和操作方法列在所有内容的下面,因此您必须在文件中上下查看表是如何更改的。这是为了遵循我所读到的“保持事物模块化和小型化”的整个方法:

import pandas as pd
#...

SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}

def main():
    df1 = clean_table_1('table1.csv')
    df2 = clean_table_2('table2.csv')
    df3 = clean_table_3('table3.csv')

    df = pd.merge(df1, df2, on='col_a')
    df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
    df['new_col2'] = df['new_col1'].map(SOME_MAPPER)

    df = pd.merge(df, df3, on='new_col2')
    df['new_col3'] = df['something']+df['new_col2']
    df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)   

    df = df.rename(columns=COLUMNS_RENAMER)
    return df

def some_operation(x,y,z):
    #<calculations for performing on table column>

def some_other_operation(a,b):
    #<some calculation>

def clean_table_1(fn_1):
    df = pd.read_csv(fn_1)
    df['some_col1'] = 400
    def do_operations_unique_to_table1(df):
        #<operations>
        return df
    df = do_operations_unique_to_table1(df)

    return df

def clean_table_2(fn_2):
    #<similar to clean_table_1>

def clean_table_3(fn_3):
    #<similar to clean_table_1>

if __name__=='__main__':
    main()
将熊猫作为pd导入
#...
一些映射程序={'a':1,'b':2,…}
列_rename={'new_col1':'aaa',…}
def main():
df1=清洁表格1('table1.csv')
df2=清洁表格2('table2.csv')
df3=清洁表3('table3.csv')
df=pd.merge(df1,df2,on='col_a')
df['new_col1']=df.apply(lambda r:some_操作(r['x'],r['y'],r['z']),轴=1)
df['new_col2']=df['new_col1'].map(一些地图绘制者)
df=pd.merge(df,df3,on='new_col2')
df['new_col3']=df['something']+df['new_col2']
df['new_col4']=df.apply(lambda r:some_other_操作(r['a'],r['b']),轴=1)
df=df.rename(columns=columns\u rename)
返回df
定义某些_操作(x、y、z):
#
定义一些其他操作(a、b):
#
def清洁表1(fn表1):
df=pd.read\U csv(fn\U 1)
df['some_col1']=400
def do_操作与表1(df)的唯一性:
#
返回df
df=表1的唯一操作(df)
返回df
def清洁表2(fn表2):
#
def清洁表3(fn表3):
#
如果“名称”=“\uuuuuuuu主要”:
main()
我的下一个倾向是按照主脚本移动所有函数,因此很明显正在做什么(参见布局B)。这使得查看正在执行的操作的线性度变得更容易,但也使它变得更混乱,因此您无法快速阅读主函数以获得所有正在执行的操作的“概述”

# LAYOUT B
import pandas as pd
#...

SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}

def main():
    def clean_table_1(fn_1):
        df = pd.read_csv(fn_1)
        df['some_col1'] = 400
        def do_operations_unique_to_table1(df):
            #<operations>
            return df
        df = do_operations_unique_to_table1(df)

    df1 = clean_table_1('table1.csv')


    def clean_table_2(fn_2):
        #<similar to clean_table_1>

    df2 = clean_table_2('table2.csv')


    def clean_table_3(fn_3):
        #<similar to clean_table_1>

    df3 = clean_table_3('table3.csv')

    df = pd.merge(df1, df2, on='col_a')

    def some_operation(x,y,z):
        #<calculations for performing on table column>

    df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
    df['new_col2'] = df['new_col1'].map(SOME_MAPPER)

    df = pd.merge(df, df3, on='new_col2')

    def some_other_operation(a,b):
        #<some calculation>

    df['new_col3'] = df['something']+df['new_col2']
    df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)   

    df = df.rename(columns=COLUMNS_RENAMER)
    return df

if __name__=='__main__':
    main()
布局B 作为pd进口熊猫 #... 一些映射程序={'a':1,'b':2,…} 列_rename={'new_col1':'aaa',…} def main(): def清洁表1(fn表1): df=pd.read\U csv(fn\U 1) df['some_col1']=400 def do_操作与表1(df)的唯一性: # 返回df df=表1的唯一操作(df) df1=清洁表格1('table1.csv') def清洁表2(fn表2): # df2=清洁表格2('table2.csv') def清洁表3(fn表3): # df3=清洁表3('table3.csv') df=pd.merge(df1,df2,on='col_a') 定义某些_操作(x、y、z): # df['new_col1']=df.apply(lambda r:some_操作(r['x'],r['y'],r['z']),轴=1) df['new_col2']=df['new_col1'].map(一些地图绘制者) df=pd.merge(df,df3,on='new_col2') 定义一些其他操作(a、b): # df['new_col3']=df['something']+df['new_col2'] df['new_col4']=df.apply(lambda r:some_other_操作(r['a'],r['b']),轴=1) df=df.rename(columns=columns\u rename) 返回df 如果“名称”=“\uuuuuuuu主要”: main() 所以我想,为什么会有这些功能;如果所有内容都处于同一级别,是否会更容易理解,就像这样的脚本(布局C):

布局C 作为pd进口熊猫 #... 一些映射程序={'a':1,'b':2,…} 列_rename={'new_col1':'aaa',…} def main(): df1=pd.read\U csv(表1.csv) df1['some_col1']=400 df1=# df2=pd.read\U csv(表2.csv) df2['some_col2']=200 df2=# df3=pd.read\U csv(表3.csv) df3['some_col3']=800 df3=# df=pd.merge(df1,df2,on='col_a') 定义某些_操作(x、y、z): # df['new_col1']=df.apply(lambda r:some_操作(r['x'],r['y'],r['z']),轴=1) df['new_col2']=df['new_col1'].map(一些地图绘制者) df=pd.merge(df,df3,on='new_col2') 定义一些其他操作(a、b): # df['new_col3']=df['something']+df['new_col2'] df['new_col4']=df.apply(lambda r:some_other_操作(r['a'],r['b']),轴=1) df=df.rename(columns=columns\u rename) 返回df 如果“名称”=“\uuuuuuuu主要”: main() 问题的关键在于,在清楚地记录哪些列正在被更新、更改、删除、重命名、合并等的同时,仍然保持足够的模块化,以适应“干净代码”的范例,从而找到一种平衡

此外,在实践中,此脚本和其他脚本的长度要长得多,并且有更多的表被合并到混合中,因此这很快就会成为一个很长的操作列表。我应该将操作分解成更小的文件并输出中间文件,还是仅仅要求引入错误?这是一个能够看到过程中所做的所有假设,以及它们如何影响最终状态下的数据的问题,而不必在文件之间跳转或向上、向下滚动等方式来跟踪从a到B的数据,如果这有意义的话


如果有人对如何最好地编写这些类型的数据清理/操作脚本有见解,我很乐意听取他们的意见。

这是一个非常主观的话题,但以下是我的典型方法/备注/提示:

  • 尽可能地优化调试/开发时间和易用性
  • 将流程拆分为多个脚本(例如,分别为每个表下载、预处理等,以便分别准备合并每个表)
  • 尝试在脚本中保持相同的操作顺序(例如,类型更正、填充na、缩放、新列、删除列)
  • 对于每个wrangle脚本,都有加载开始和保存结束
  • 保存到pickle(以避免将日期保存为字符串等问题)和小csv(以便在python之外轻松预览结果)
  • 由于“集成点”是数据,您可以轻松地组合不同的技术(注意:在这种情况下,通常不使用pickle作为输出,而是使用csv/oth)
    # LAYOUT C
    import pandas as pd
    #...
    
    SOME_MAPPER = {'a':1, 'b':2, ...}
    COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
    
    def main():
    
        df1 = pd.read_csv('table1.csv)
        df1['some_col1'] = 400
        df1 = #<operations on df1>  
    
        df2 = pd.read_csv('table2.csv)
        df2['some_col2'] = 200
        df2 = #<operations on df2>  
    
        df3 = pd.read_csv('table3.csv)
        df3['some_col3'] = 800
        df3 = #<operations on df3>  
    
        df = pd.merge(df1, df2, on='col_a')
    
        def some_operation(x,y,z):
            #<calculations for performing on table column>
    
        df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
        df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
    
        df = pd.merge(df, df3, on='new_col2')
    
        def some_other_operation(a,b):
            #<some calculation>
    
        df['new_col3'] = df['something']+df['new_col2']
        df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)   
    
        df = df.rename(columns=COLUMNS_RENAMER)
        return df
    
    if __name__=='__main__':
        main()