如何在python上将多个.txt文件中的信息组织到一个数据库中?

如何在python上将多个.txt文件中的信息组织到一个数据库中?,python,python-3.x,excel,pandas,dataframe,Python,Python 3.x,Excel,Pandas,Dataframe,链接上方是一个名为19031783_result.txt的结果文件。每个.txt文件都包含统计结果,我想将其组织到数据库中 所有结果文件的数据库输出应如下所示: 因此,我有数百个结果文件需要合并到一个数据库中。最后三列是每个料仓缺陷计数的限制,例如,料仓1限制为10,料仓2限制为5,料仓3限制为3,料仓4限制为0。所以完美意味着没有缺陷,好意味着它在规格范围内,失败意味着它在极限之上 我在python方面没有太多经验,我需要一个如何从.txt文件创建这个数据库的指导。Python使用起来更好,

链接上方是一个名为19031783_result.txt的结果文件。每个.txt文件都包含统计结果,我想将其组织到数据库中

所有结果文件的数据库输出应如下所示:

因此,我有数百个结果文件需要合并到一个数据库中。最后三列是每个料仓缺陷计数的限制,例如,料仓1限制为10,料仓2限制为5,料仓3限制为3,料仓4限制为0。所以完美意味着没有缺陷,好意味着它在规格范围内,失败意味着它在极限之上

我在python方面没有太多经验,我需要一个如何从.txt文件创建这个数据库的指导。Python使用起来更好,因为它可以处理大量数据,而且速度更快

import os
import pandas as pd
from glob import glob

stock_files = sorted(glob('*result.txt'))

stock_files

df = pd.concat([pd.read_csv(file, sep="\t").assign(filename = file) for file in stock_files], ignore_index = True)

df = pd.DataFrame() #this is the bit I am stuck on
这是我当前的输出,我需要将其清理并转换为数据库,我有一个excel电子表格的屏幕截图:

开始简单-python可能不是胡佛和重塑活动的基本数据的朋友

我将用bash和windowscmd演示如何使用基本操作系统命令将所有文件合并到一个文件中

对于bash-安装WSL和首选的Linux发行版 如果没有太多的假设,使用Bash脚本几乎是一行 读取结果文件列表-并对每个文件进行分类,并在每个文件的开头插入文件名-存储到单个文件AllResultsFiles.txt中

ls -1 *_result.txt | while read fname; do cat $fname | while read _line; do echo $fname:$_line; done; done > AllResultsFiles.txt
对于windows-稍微复杂一点-但这是相同的-将其存储到C:\users\me\data\merges.cmd文件中:

然后按如下方式运行

C:\users\me\data\> mergers.cmd
with cte_allresults as (
select orig_filename
     , max(case when substring(Metric,1,4) in 'Dela' then Value) as percent_area
     , max(case when substring(Metric,1,4) in ('Bin1') then Value end) as Bin1
     , max(case when substring(Metric,1,4) in ('Bin2') then Value end) as Bin2
     , max(case when substring(Metric,1,4) in ('Bin3') then Value end) as Bin3
     , max(case when substring(metric,1,4) in ('Bin4') then Value end) as Bin4
  from imported_results_statistics
group by orig_filename
) 
select orig_filename as scribe_no /* I assume this is meant to reflect the file name?
     , percent_area
     , Bin1 as LessThan75UM     , Bin1 as From75To300UM
     , Bin3 as From30to1MM     , Bin4 as MoreThan1MM
     , case when Bin1 + Bin2 + Bin3 + Bin4 < 10 then 0 as perfect /* (apply your  own specific ranges and rules here *
/*    ...  */
   from cte_allresults 
这将创建一个包含所有文件内容的单个文件,并在第一列中使用冒号:分隔符-每行包含原始文件的名称

然后,可以使用冒号作为分隔符,轻松地将其导入到电子表格或具有三列的数据库中

create table imported_results_statistics 
(Orig_filename varchar(100),
Metric varchar(200),
Value int
)
一旦导入到数据库表中,您就可以使用SQL操作来创建一个新表—根据文件名转换每组记录-

sqlite很简单,但需要更多的步骤

如果您有一个更强大的数据库工具包,您可以得到如下压缩结果

C:\users\me\data\> mergers.cmd
with cte_allresults as (
select orig_filename
     , max(case when substring(Metric,1,4) in 'Dela' then Value) as percent_area
     , max(case when substring(Metric,1,4) in ('Bin1') then Value end) as Bin1
     , max(case when substring(Metric,1,4) in ('Bin2') then Value end) as Bin2
     , max(case when substring(Metric,1,4) in ('Bin3') then Value end) as Bin3
     , max(case when substring(metric,1,4) in ('Bin4') then Value end) as Bin4
  from imported_results_statistics
group by orig_filename
) 
select orig_filename as scribe_no /* I assume this is meant to reflect the file name?
     , percent_area
     , Bin1 as LessThan75UM     , Bin1 as From75To300UM
     , Bin3 as From30to1MM     , Bin4 as MoreThan1MM
     , case when Bin1 + Bin2 + Bin3 + Bin4 < 10 then 0 as perfect /* (apply your  own specific ranges and rules here *
/*    ...  */
   from cte_allresults 
对于perfect/good/fail,您可以在python脚本中定义它

作为一个SQL和shell脚本的痴迷者,我已经记不得这么多年了——最近学习了python——我可以保证这种方法会更快地为您服务——无论如何,将单个文件放入您的python脚本中- 在python脚本中生成bash/cmd脚本作为文本变量
如果您要定期执行此操作,请使用python subprocess.call。您认为这样做会奏效吗

import os
import pandas as pd
import numpy as np
from glob import glob

stock_files = sorted(glob('*result.txt'))

# first, create an empty dataframe with the columns names you want
df = pd.DataFrame(columns = ['percent_area', 'less_than_75um', '75_to_300um', '301_to_1mm', 'more_than_1mm', 'perfect', 'good', 'fail'])
# create auxiliary functions to evaluate the quality of the bins
bin_limits = [10, 5, 3, 0]
check_perfect = lambda x: 1 if sum(x) == 0 else 0
check_good = lambda x: 1 if all(np.array(x) <= bin_limits) else 0
check_fail = lambda x: 1 if all(np.array(x) > bin_limits) else 0

# loop through each file
for file in stock_files:
    # read each file and split the lines with ':' delimiter to have the value for each line and convert it to float
    # store those 4 values under 'values' list variable
    values = list(
                seq(
                    pd.read_csv(
                        file, header = None
                        )
                    )\
                .map(
                    lambda x: x[0].split(':')[1]
                    )\
                .map(
                    lambda x: float(x)
                    )
                )
    
    # check the quality of the bins and store the 3 results in 'bins_quality' list variable
    bins_quality = [check_perfect(values[1::]), check_good(values[1::]), check_fail(values[1::])]

    # add a line to the DataFrame with the name of the file in the index and the above values that correspond to the defined columns
    df.loc[file] = values + bins_quality

据我所知,没有一个模块可以轻松做到这一点,所以最好分几个步骤来完成。你有任何现有的代码,你已经试图让这个工作?我添加了启动模块
import os
import pandas as pd
import numpy as np
from glob import glob

stock_files = sorted(glob('*result.txt'))

# first, create an empty dataframe with the columns names you want
df = pd.DataFrame(columns = ['percent_area', 'less_than_75um', '75_to_300um', '301_to_1mm', 'more_than_1mm', 'perfect', 'good', 'fail'])
# create auxiliary functions to evaluate the quality of the bins
bin_limits = [10, 5, 3, 0]
check_perfect = lambda x: 1 if sum(x) == 0 else 0
check_good = lambda x: 1 if all(np.array(x) <= bin_limits) else 0
check_fail = lambda x: 1 if all(np.array(x) > bin_limits) else 0

# loop through each file
for file in stock_files:
    # read each file and split the lines with ':' delimiter to have the value for each line and convert it to float
    # store those 4 values under 'values' list variable
    values = list(
                seq(
                    pd.read_csv(
                        file, header = None
                        )
                    )\
                .map(
                    lambda x: x[0].split(':')[1]
                    )\
                .map(
                    lambda x: float(x)
                    )
                )
    
    # check the quality of the bins and store the 3 results in 'bins_quality' list variable
    bins_quality = [check_perfect(values[1::]), check_good(values[1::]), check_fail(values[1::])]

    # add a line to the DataFrame with the name of the file in the index and the above values that correspond to the defined columns
    df.loc[file] = values + bins_quality