Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/excel/23.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 试图合并到数据框中,但它会不断创建新列_Python_Excel_Pandas - Fatal编程技术网

Python 试图合并到数据框中,但它会不断创建新列

Python 试图合并到数据框中,但它会不断创建新列,python,excel,pandas,Python,Excel,Pandas,我试图打开文件,从多个电子表格中导出2列(每列1行),然后将它们合并到基本电子表格中。因此,基本数据框(从电子表格中导出,我只需要3列)如下所示: Model | Roadmap | Family a 08/12/17 ROW b 08/14/17 MACRO c 08/15/17 CONN d 08/27/17 MACRO 来自多个电子表格的数据框(模型名称是电子表格名称,它们有我在多个数据框中导出的每个闸门的多个日期),具有以

我试图打开文件,从多个电子表格中导出2列(每列1行),然后将它们合并到基本电子表格中。因此,基本数据框(从电子表格中导出,我只需要3列)如下所示:

Model |  Roadmap | Family
a       08/12/17  ROW
b       08/14/17  MACRO 
c       08/15/17  CONN 
d       08/27/17  MACRO 
来自多个电子表格的数据框(模型名称是电子表格名称,它们有我在多个数据框中导出的每个闸门的多个日期),具有以下格式:

    df1 (part1 -  the dataframe derived from the spreadsheet with model a for gate 0 ):
    Model   |  Gate 0 
    a         02/01/18  

df1 (Dataframe derived from the spreadsheet with model a for gate1):
        Model   |  Gate 1
        a         03/01/18   


   df2 (part1):
    Model  |  Gate 0 
    b       04/23/18   

df2 (part1):
        Model  |  Gate 1 
        b       05/23/18   
它产生的输出是:

Model |  Roadmap | Family | Gate 0_x  | Gate 1_x   | gate 0_y | Gate 1_y
a       08/12/17  ROW      02/01/18   03/01/18  
b       08/14/17  MACRO                              04/23/18  05/23/18     
c       08/15/17  CONN
d       08/27/17  MACRO 
我想要的输出:

  Model |  Roadmap | Family | Gate 0   | Gate 1   
   a       08/12/17  ROW     02/01/18   03/01/18
   b       08/14/17  MACRO    04/23/18  05/23/18 
   ..
以下是我正在使用的代码:

import glob
import pandas as pd
import re
import ntpath




extension = 'xlsx'
d='Final.xlsx'
c = 'Roadmap.xlsx'
dflist = []
z=[]
result = [i for i in glob.glob('*.{}'.format(extension))]

for b in result:
    if b==c:
        base_file = pd.read_excel(b, sheet_name='Antennas', header=7)
        ind1 = base_file.set_index('Model')
        ind1 = base_file[['Model', 'Roadmap', 'Family']]
        #print(ind1)
        ind1.to_excel('Final.xlsx')
        file3 = pd.read_excel('Final.xlsx')
        file3= file3.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)



for a in result:

        if a == c:
            base_file = pd.read_excel(a, sheet_name='Antennas', header=7)
            ind1 = base_file.set_index('Model')
            ind1 = base_file[['Model', 'Roadmap', 'Family']]
            ind1.to_excel('Final.xlsx')
        elif a != d:
            gates = ['Gate 0 Complete','Gate 1 Complete'] 
            file1 = pd.read_excel('Final.xlsx')
            file1= file1.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)     
            #print(file1)
            file = pd.read_excel(a, sheet_name='Timeline')
            #print(file)
            models = pd.DataFrame([['','']], columns=['Model', gates])
            for g in gates:      
                z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
                v=ntpath.basename(a)
                v = v[5:-5]
                models = pd.DataFrame([[v,z]], columns =['Model',g])
                file1 = pd.merge(file1, models, how='left', on='Model')
            file3 = pd.merge(file3, file1, how='left' ,['Model','Roadmap','Family'])
            file3.to_excel('new.xlsx')

file3是我在for循环之前作为基本文件的数据帧打开的文件。如果有任何不清楚的地方,请告诉我。

目前,您正在合并两次,但确实需要将base与单个dfs合并,然后将所有dfs一起追加

下面重新创建上面发布的示例,这些示例采用与Excel文件相同的结构,并演示合并和附加步骤。您会注意到,由于左连接合并呈现相同的行值,因此使用了
drop\u duplicates
。在实际数据上保留或删除此方法

数据

from io import StringIO
import pandas as pd

txt = '''
Model  Roadmap  Family
a      some_date  some
b      some_date  some 
c      some_date  some 
d      some_date  some
'''
base_df = pd.read_table(StringIO(txt), sep="\s+")

txt = '''
Model  "Gate 0" "Gate 1"
    a   some_date  some 
'''
df1 = pd.read_table(StringIO(txt), sep="\s+")

txt = '''
Model  "Gate 0" "Gate 1"
    b   some_date  some 
'''
df2 = pd.read_table(StringIO(txt), sep="\s+")
合并和追加(使用列表理解)

在当前流程中集成,考虑将各个模型附加到列表中,以便在最后连接和合并。按照上面发布的示例构建基本df

...
dfList = []

for g in gates:      
     z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
     v = ntpath.basename(a)
     v = v[5:-5]
     mod = pd.DataFrame([[v,z]], columns =['Model',g])
     models = pd.merge(models, mod, how='left', on='Model')
dfList.append(models)

finaldf = pd.merge(base_df, pd.concat(dfList), how='left', on='Model')
finaldf.to_excel('Final_Dataset.xlsx')

目前,您正在合并两次,但确实需要将基本dfs与单个dfs合并,然后将所有dfs一起追加

下面重新创建上面发布的示例,这些示例采用与Excel文件相同的结构,并演示合并和附加步骤。您会注意到,由于左连接合并呈现相同的行值,因此使用了
drop\u duplicates
。在实际数据上保留或删除此方法

数据

from io import StringIO
import pandas as pd

txt = '''
Model  Roadmap  Family
a      some_date  some
b      some_date  some 
c      some_date  some 
d      some_date  some
'''
base_df = pd.read_table(StringIO(txt), sep="\s+")

txt = '''
Model  "Gate 0" "Gate 1"
    a   some_date  some 
'''
df1 = pd.read_table(StringIO(txt), sep="\s+")

txt = '''
Model  "Gate 0" "Gate 1"
    b   some_date  some 
'''
df2 = pd.read_table(StringIO(txt), sep="\s+")
合并和追加(使用列表理解)

在当前流程中集成,考虑将各个模型附加到列表中,以便在最后连接和合并。按照上面发布的示例构建基本df

...
dfList = []

for g in gates:      
     z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
     v = ntpath.basename(a)
     v = v[5:-5]
     mod = pd.DataFrame([[v,z]], columns =['Model',g])
     models = pd.merge(models, mod, how='left', on='Model')
dfList.append(models)

finaldf = pd.merge(base_df, pd.concat(dfList), how='left', on='Model')
finaldf.to_excel('Final_Dataset.xlsx')

关于您的原始数据,我假设如下:

  • 第0步-第1部分。您加载
    df_base
  • 第0步-第2部分。您可以加载
    df1
    df2
    等-每张图纸一个
    df
  • 然后,我的方法是(按顺序)执行以下步骤:

  • 将所有工作表的
    df
    垂直连接到名为
    df\u sheets
  • dfu-base
    dfu-sheets
    合并,以获得所需的输出
  • 基于此,我的方法是:

    import pandas as pd
    
    # STEP 0.
    cv = ['a','b','c','d']
    nr = 4
    
    # STEP 0 - Part 1. Load Base DF
    cv = cv[:nr]
    df_base = pd.DataFrame(zip(*[cv,['some_date']*nr,['some']*nr]),
                      columns=['Model','Roadmap','Family'])
    
    # STEP 0 - Part 2. Load Sheets DataFrames
    df_sheets = []
    for alph in cv:
        df_sheet = pd.DataFrame(zip(*[[alph]*nr,['some_date_'+alph]*nr,['some_'+alph]*nr]),
                                columns=['Model','Gate0','Gate1'])
        df_sheets.append(df_sheet)
    print('Base DF:\n{}' .format(df_base))
    
    
    # STEP 1. Vertically conctenate all sheets DataFrames together
    df_sheets = pd.concat(df_sheets, axis=0).reset_index(drop=True)
    print('\nDataFrames for all sheets (vertically concatenated into single DataFrame):\n{}'
        .format(df_sheets))
    
    
    # STEP 2. base INNER JOIN sheets USING ('Model')
    ndf = df_base.merge(df_sheets, on='Model', how='inner')
    print('\nOutput DataFrame:\n{}' .format(ndf))
    
    输出为:

    Base DF:
      Model    Roadmap Family
    0     a  some_date   some
    1     b  some_date   some
    2     c  some_date   some
    3     d  some_date   some
    
    DataFrames for all sheets (vertically concatenated into single DataFrame):
       Model        Gate0   Gate1
    0      a  some_date_a  some_a
    1      a  some_date_a  some_a
    2      a  some_date_a  some_a
    3      a  some_date_a  some_a
    4      b  some_date_b  some_b
    5      b  some_date_b  some_b
    6      b  some_date_b  some_b
    7      b  some_date_b  some_b
    8      c  some_date_c  some_c
    9      c  some_date_c  some_c
    10     c  some_date_c  some_c
    11     c  some_date_c  some_c
    12     d  some_date_d  some_d
    13     d  some_date_d  some_d
    14     d  some_date_d  some_d
    15     d  some_date_d  some_d
    
    Output DataFrame:
       Model    Roadmap Family        Gate0   Gate1
    0      a  some_date   some  some_date_a  some_a
    1      a  some_date   some  some_date_a  some_a
    2      a  some_date   some  some_date_a  some_a
    3      a  some_date   some  some_date_a  some_a
    4      b  some_date   some  some_date_b  some_b
    5      b  some_date   some  some_date_b  some_b
    6      b  some_date   some  some_date_b  some_b
    7      b  some_date   some  some_date_b  some_b
    8      c  some_date   some  some_date_c  some_c
    9      c  some_date   some  some_date_c  some_c
    10     c  some_date   some  some_date_c  some_c
    11     c  some_date   some  some_date_c  some_c
    12     d  some_date   some  some_date_d  some_d
    13     d  some_date   some  some_date_d  some_d
    14     d  some_date   some  some_date_d  some_d
    15     d  some_date   some  some_date_d  some_d
    

    这就是你想要的吗?

    关于你的原始数据,我假设如下:

  • 第0步-第1部分。您加载
    df_base
  • 第0步-第2部分。您可以加载
    df1
    df2
    等-每张图纸一个
    df
  • 然后,我的方法是(按顺序)执行以下步骤:

  • 将所有工作表的
    df
    垂直连接到名为
    df\u sheets
  • dfu-base
    dfu-sheets
    合并,以获得所需的输出
  • 基于此,我的方法是:

    import pandas as pd
    
    # STEP 0.
    cv = ['a','b','c','d']
    nr = 4
    
    # STEP 0 - Part 1. Load Base DF
    cv = cv[:nr]
    df_base = pd.DataFrame(zip(*[cv,['some_date']*nr,['some']*nr]),
                      columns=['Model','Roadmap','Family'])
    
    # STEP 0 - Part 2. Load Sheets DataFrames
    df_sheets = []
    for alph in cv:
        df_sheet = pd.DataFrame(zip(*[[alph]*nr,['some_date_'+alph]*nr,['some_'+alph]*nr]),
                                columns=['Model','Gate0','Gate1'])
        df_sheets.append(df_sheet)
    print('Base DF:\n{}' .format(df_base))
    
    
    # STEP 1. Vertically conctenate all sheets DataFrames together
    df_sheets = pd.concat(df_sheets, axis=0).reset_index(drop=True)
    print('\nDataFrames for all sheets (vertically concatenated into single DataFrame):\n{}'
        .format(df_sheets))
    
    
    # STEP 2. base INNER JOIN sheets USING ('Model')
    ndf = df_base.merge(df_sheets, on='Model', how='inner')
    print('\nOutput DataFrame:\n{}' .format(ndf))
    
    输出为:

    Base DF:
      Model    Roadmap Family
    0     a  some_date   some
    1     b  some_date   some
    2     c  some_date   some
    3     d  some_date   some
    
    DataFrames for all sheets (vertically concatenated into single DataFrame):
       Model        Gate0   Gate1
    0      a  some_date_a  some_a
    1      a  some_date_a  some_a
    2      a  some_date_a  some_a
    3      a  some_date_a  some_a
    4      b  some_date_b  some_b
    5      b  some_date_b  some_b
    6      b  some_date_b  some_b
    7      b  some_date_b  some_b
    8      c  some_date_c  some_c
    9      c  some_date_c  some_c
    10     c  some_date_c  some_c
    11     c  some_date_c  some_c
    12     d  some_date_d  some_d
    13     d  some_date_d  some_d
    14     d  some_date_d  some_d
    15     d  some_date_d  some_d
    
    Output DataFrame:
       Model    Roadmap Family        Gate0   Gate1
    0      a  some_date   some  some_date_a  some_a
    1      a  some_date   some  some_date_a  some_a
    2      a  some_date   some  some_date_a  some_a
    3      a  some_date   some  some_date_a  some_a
    4      b  some_date   some  some_date_b  some_b
    5      b  some_date   some  some_date_b  some_b
    6      b  some_date   some  some_date_b  some_b
    7      b  some_date   some  some_date_b  some_b
    8      c  some_date   some  some_date_c  some_c
    9      c  some_date   some  some_date_c  some_c
    10     c  some_date   some  some_date_c  some_c
    11     c  some_date   some  some_date_c  some_c
    12     d  some_date   some  some_date_d  some_d
    13     d  some_date   some  some_date_d  some_d
    14     d  some_date   some  some_date_d  some_d
    15     d  some_date   some  some_date_d  some_d
    

    这就是你想要的吗?

    知道怎么做了。如果您发现任何问题,请告诉我

    import glob
    import pandas as pd
    import re
    import ntpath
    
    extension = 'xlsx'
    d='Final.xlsx'
    c = 'Roadmap.xlsx'
    dflist = []
    z=[]
    result = [i for i in glob.glob('*.{}'.format(extension))]
    
    for a in result:
    
        if a == c:
            base_file = pd.read_excel(a, sheet_name='Antennas', header=7)
            ind1 = base_file.set_index('Model')
            ind1 = base_file[['Model', 'Roadmap', 'Family']]
            #print(ind1)
            ind1.to_excel('Final.xlsx')
        elif a != d:
            v=ntpath.basename(a)
            v = v[5:-5]
            gates = ['Gate 0 Complete','Gate 1 Complete', 'Gate 2 Complete'] 
            file1 = pd.read_excel('Final.xlsx')
            file1= file1.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)     
            #print(file1)
            file = pd.read_excel(a, sheet_name='Timeline')
            #print(file)
            models = pd.DataFrame([[v]], columns=['Model'])
            #print(models)
            for g in gates:      
                z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
                #print(z)
                #v = re.findall(r'Scrum(\w+)', a)
                #print(v)
                #df1=pd.DataFrame([[v,z]], columns = ['Model',g])
                mod = pd.DataFrame([[v,z]], columns =['Model',g])
                models=pd.merge(models, mod, how='left', on='Model')
                #print(models)
            dflist.append(models)
            #print(dflist)
    file1 = pd.merge(file1,pd.concat(dflist), how='left',on='Model')
    file1.to_excel('new.xlsx')
    

    我知道怎么做。如果您发现任何问题,请告诉我

    import glob
    import pandas as pd
    import re
    import ntpath
    
    extension = 'xlsx'
    d='Final.xlsx'
    c = 'Roadmap.xlsx'
    dflist = []
    z=[]
    result = [i for i in glob.glob('*.{}'.format(extension))]
    
    for a in result:
    
        if a == c:
            base_file = pd.read_excel(a, sheet_name='Antennas', header=7)
            ind1 = base_file.set_index('Model')
            ind1 = base_file[['Model', 'Roadmap', 'Family']]
            #print(ind1)
            ind1.to_excel('Final.xlsx')
        elif a != d:
            v=ntpath.basename(a)
            v = v[5:-5]
            gates = ['Gate 0 Complete','Gate 1 Complete', 'Gate 2 Complete'] 
            file1 = pd.read_excel('Final.xlsx')
            file1= file1.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)     
            #print(file1)
            file = pd.read_excel(a, sheet_name='Timeline')
            #print(file)
            models = pd.DataFrame([[v]], columns=['Model'])
            #print(models)
            for g in gates:      
                z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
                #print(z)
                #v = re.findall(r'Scrum(\w+)', a)
                #print(v)
                #df1=pd.DataFrame([[v,z]], columns = ['Model',g])
                mod = pd.DataFrame([[v,z]], columns =['Model',g])
                models=pd.merge(models, mod, how='left', on='Model')
                #print(models)
            dflist.append(models)
            #print(dflist)
    file1 = pd.merge(file1,pd.concat(dflist), how='left',on='Model')
    file1.to_excel('new.xlsx')
    

    列的名称是否完全相同?您可能希望添加实际文件的一些示例数据,并删除不必要的部分代码(如看起来像regex),以帮助其他人更快地看到错误@MaartenFabré的评论发生得比你想象的要多——试着调整领先/落后spaces@MaartenFabré它们是一样的,但是,当我合并第一个文件时,这很好,但是当第二个文件(dataframe)进入时,它会再次添加路线图和族列,然后它会添加带有_y前缀的两个门,以及带有_x前缀的前2个门。我通过合并这些列,修复了路线图和家庭列再次合并的问题。请看编辑程序。@MattR更好吗?这是一个问题,因为合并没有在这些门列上指定连接吗?因此,file3和file1具有相同的结构?除了file1有新的gate1列和gate 0列之外?列的名称是否完全相同?您可能希望添加实际文件的一些示例数据,并删除一些不必要的代码(如看起来像regex),以帮助其他人更快地看到错误@MaartenFabré的评论发生得比你想象的要多——试着调整领先/落后spaces@MaartenFabré它们是一样的,但是,当我合并第一个文件时,这很好,但是当第二个文件(dataframe)进入时,它会再次添加路线图和族列,然后它会添加带有_y前缀的两个门,以及带有_x前缀的前2个门。我通过合并这些列,修复了路线图和家庭列再次合并的问题。请看编辑程序。@MattR更好吗?这是一个问题,因为合并没有在这些门列上指定连接吗?因此,file3和file1具有相同的结构?除了file1有新的gate1和gate0列之外?但是如果df1和df2同时存在,您会考虑这个问题,对吗?我首先创建df1,然后将其与基本df合并,然后创建df2并与基本df合并。我不认为这可以用你的代码来完成,是吗?做这个。在一个进程中创建所有dfs(无合并/追加),然后运行此进程,将所有dfs读入列表并运行列表理解方法。您甚至可以反转连接dfs列表的位置,然后合并一次:
    pd.merge(base_df,pd.concat([dfList]),how='left',on='Model')
    。甚至不需要输出到Excel!你是说创建数据帧并将其插入列表?明白了。。。我想将最终的数据集捕获到excel中。在中转换数据帧