将excel文件拆分为数据帧，然后在Python中从中创建两个新文件_Python_Pandas_Dataframe

将excel文件拆分为数据帧，然后在Python中从中创建两个新文件

python pandas dataframe

将excel文件拆分为数据帧，然后在Python中从中创建两个新文件,python,pandas,dataframe,Python,Pandas,Dataframe,很抱歉，标题措词不当。在这个项目中，我引入了多个需要处理的excel文件，然后作为多个csv文件发送回来（最终进入BigQuery）。我尝试做的几件事是消除最后6行（这是一个不需要的水印），然后创建两个单独的csv文件。excel文件的外观如下所示：我用skipfooter删除最后6行，然后创建第一个数据帧（第1-181列）和第二个数据帧（第182-225列）。我可以拆分它们，但在使用附加或合并时遇到了问题（可能做得不正确）。我想做的是在第二个csv中插入PID，并在新的第一列中填充，类似这

很抱歉，标题措词不当。在这个项目中，我引入了多个需要处理的excel文件，然后作为多个csv文件发送回来（最终进入BigQuery）。我尝试做的几件事是消除最后6行（这是一个不需要的水印），然后创建两个单独的csv文件。excel文件的外观如下所示：

我用skipfooter删除最后6行，然后创建第一个数据帧（第1-181列）和第二个数据帧（第182-225列）。我可以拆分它们，但在使用附加或合并时遇到了问题（可能做得不正确）。我想做的是在第二个csv中插入PID，并在新的第一列中填充，类似这样：

我的大问题是如何正确地将PID组合（附加）到所有需要的行上，以及如何循环使用我引入的数百个excel文件，以确保正确的PID将被放入正确的记录测试中？此时，我正在处理一个文件，以便查看它是否正常工作。在下面的代码中，我的append将把index_df附加到second_df，但我不确定如何用相同的PID填充其余的行

import os
import pandas as pd
import csv
raw_data_frame = pd.read_excel('\\\\file01\\incoming\\mat\ID5.xlsx', skipfooter=6)
first_df = raw_data_frame.iloc[:, 1:182]
second_df = raw_data_frame.iloc[:, 182:225]
index_df = raw_data_fram.iloc[0:1, 4:5]
df_combine = df_id.append(second_df)

问题的一部分是您说它已经很好地工作了，您可以使用append和merge来实现这一点，但是我认为（

df.insert（0，column=“PID”，value=df[“PID”]）

）后面跟一个

ffill

在这种情况下效果更好。对于

xls

文件的迭代，您可以使用For循环查找预定义文件夹中的所有文档。生成输出文件的方式必须适应您的问题，这里我选择将每个

csv

文件对放置在一个新文件夹中，并带有相应的

PID

编号

import pandas as pd
import glob
import os

INPUT_FOLDER = "input_xls"
OUTPUT_FOLDER = "output_xls"

for excel_file in glob.glob(os.path.join(INPUT_FOLDER, '*.xls')):

    df = pd.read_excel(excel_file, skipfooter=6, dtype=str)
    print(df)

    # change to 182 here
    COL_SPLIT = 5

    first_df = df.iloc[:,:COL_SPLIT]
    first_df = first_df.dropna(how="all")

    second_df = df.iloc[:, COL_SPLIT:]
    second_df = second_df.dropna(how="all")
    second_df.insert(0, column="PID", value=df["PID"])
    second_df["PID"].ffill(inplace=True)

    print(first_df)
    print(second_df)

    pid = first_df.loc[0, "PID"]

    out_path = os.path.join(OUTPUT_FOLDER, f'PID-{pid}')
    os.makedirs(out_path, exist_ok=True)
    first_df.to_csv(os.path.join(out_path,"first.csv"), index=False)
    second_df.to_csv(os.path.join(out_path,"second.csv"), index=False)

数据帧第一个数据帧

PID Last First Gender Age 0 111 Guy Some M 35

PID Record# testl test2 test3 0 111 222 378 24 371 1 111 223 319 28 311 2 111 224 207 20 210 3 111 225 100 30 200
数据帧第二帧

PID Last First Gender Age 0 111 Guy Some M 35

PID Record# testl test2 test3 0 111 222 378 24 371 1 111 223 319 28 311 2 111 224 207 20 210 3 111 225 100 30 200

问题的一部分是您说它已经很好地工作了，您可以使用append和merge来实现这一点，但是我认为（
df.insert（0，column=“PID”，value=df[“PID”]）
）后面跟一个
ffill
在这种情况下效果更好。对于
xls
文件的迭代，您可以使用For循环查找预定义文件夹中的所有文档。生成输出文件的方式必须适应您的问题，这里我选择将每个
csv
文件对放置在一个新文件夹中，并带有相应的
PID
编号

import pandas as pd import glob import os INPUT_FOLDER = "input_xls" OUTPUT_FOLDER = "output_xls" for excel_file in glob.glob(os.path.join(INPUT_FOLDER, '*.xls')): df = pd.read_excel(excel_file, skipfooter=6, dtype=str) print(df) # change to 182 here COL_SPLIT = 5 first_df = df.iloc[:,:COL_SPLIT] first_df = first_df.dropna(how="all") second_df = df.iloc[:, COL_SPLIT:] second_df = second_df.dropna(how="all") second_df.insert(0, column="PID", value=df["PID"]) second_df["PID"].ffill(inplace=True) print(first_df) print(second_df) pid = first_df.loc[0, "PID"] out_path = os.path.join(OUTPUT_FOLDER, f'PID-{pid}') os.makedirs(out_path, exist_ok=True) first_df.to_csv(os.path.join(out_path,"first.csv"), index=False) second_df.to_csv(os.path.join(out_path,"second.csv"), index=False)
数据帧第一个数据帧

PID Last First Gender Age 0 111 Guy Some M 35

PID Record# testl test2 test3 0 111 222 378 24 371 1 111 223 319 28 311 2 111 224 207 20 210 3 111 225 100 30 200
数据帧第二帧

PID Last First Gender Age 0 111 Guy Some M 35

PID Record# testl test2 test3 0 111 222 378 24 371 1 111 223 319 28 311 2 111 224 207 20 210 3 111 225 100 30 200