如何在python上将多个.txt文件中的信息组织到一个数据库中？_Python_Python 3.x_Excel_Pandas_Dataframe

如何在python上将多个.txt文件中的信息组织到一个数据库中？

python python-3.x excel pandas dataframe

如何在python上将多个.txt文件中的信息组织到一个数据库中？,python,python-3.x,excel,pandas,dataframe,Python,Python 3.x,Excel,Pandas,Dataframe,链接上方是一个名为19031783_result.txt的结果文件。每个.txt文件都包含统计结果，我想将其组织到数据库中所有结果文件的数据库输出应如下所示：因此，我有数百个结果文件需要合并到一个数据库中。最后三列是每个料仓缺陷计数的限制，例如，料仓1限制为10，料仓2限制为5，料仓3限制为3，料仓4限制为0。所以完美意味着没有缺陷，好意味着它在规格范围内，失败意味着它在极限之上我在python方面没有太多经验，我需要一个如何从.txt文件创建这个数据库的指导。Python使用起来更好，

链接上方是一个名为19031783_result.txt的结果文件。每个.txt文件都包含统计结果，我想将其组织到数据库中

所有结果文件的数据库输出应如下所示：

因此，我有数百个结果文件需要合并到一个数据库中。最后三列是每个料仓缺陷计数的限制，例如，料仓1限制为10，料仓2限制为5，料仓3限制为3，料仓4限制为0。所以完美意味着没有缺陷，好意味着它在规格范围内，失败意味着它在极限之上

我在python方面没有太多经验，我需要一个如何从.txt文件创建这个数据库的指导。Python使用起来更好，因为它可以处理大量数据，而且速度更快

import os
import pandas as pd
from glob import glob

stock_files = sorted(glob('*result.txt'))

stock_files

df = pd.concat([pd.read_csv(file, sep="\t").assign(filename = file) for file in stock_files], ignore_index = True)

df = pd.DataFrame() #this is the bit I am stuck on

这是我当前的输出，我需要将其清理并转换为数据库，我有一个excel电子表格的屏幕截图：

开始简单-python可能不是胡佛和重塑活动的基本数据的朋友

我将用bash和windowscmd演示如何使用基本操作系统命令将所有文件合并到一个文件中

对于bash-安装WSL和首选的Linux发行版如果没有太多的假设，使用Bash脚本几乎是一行读取结果文件列表-并对每个文件进行分类，并在每个文件的开头插入文件名-存储到单个文件AllResultsFiles.txt中

ls -1 *_result.txt | while read fname; do cat $fname | while read _line; do echo $fname:$_line; done; done > AllResultsFiles.txt

对于windows-稍微复杂一点-但这是相同的-将其存储到C:\users\me\data\merges.cmd文件中：

然后按如下方式运行

C:\users\me\data\> mergers.cmd

with cte_allresults as (
select orig_filename
     , max(case when substring(Metric,1,4) in 'Dela' then Value) as percent_area
     , max(case when substring(Metric,1,4) in ('Bin1') then Value end) as Bin1
     , max(case when substring(Metric,1,4) in ('Bin2') then Value end) as Bin2
     , max(case when substring(Metric,1,4) in ('Bin3') then Value end) as Bin3
     , max(case when substring(metric,1,4) in ('Bin4') then Value end) as Bin4
  from imported_results_statistics
group by orig_filename
) 
select orig_filename as scribe_no /* I assume this is meant to reflect the file name?
     , percent_area
     , Bin1 as LessThan75UM     , Bin1 as From75To300UM
     , Bin3 as From30to1MM     , Bin4 as MoreThan1MM
     , case when Bin1 + Bin2 + Bin3 + Bin4 < 10 then 0 as perfect /* (apply your  own specific ranges and rules here *
/*    ...  */
   from cte_allresults

这将创建一个包含所有文件内容的单个文件，并在第一列中使用冒号：分隔符-每行包含原始文件的名称

然后，可以使用冒号作为分隔符，轻松地将其导入到电子表格或具有三列的数据库中

create table imported_results_statistics 
(Orig_filename varchar(100),
Metric varchar(200),
Value int
)

一旦导入到数据库表中，您就可以使用SQL操作来创建一个新表—根据文件名转换每组记录-

sqlite很简单，但需要更多的步骤

如果您有一个更强大的数据库工具包，您可以得到如下压缩结果

C:\users\me\data\> mergers.cmd

with cte_allresults as (
select orig_filename
     , max(case when substring(Metric,1,4) in 'Dela' then Value) as percent_area
     , max(case when substring(Metric,1,4) in ('Bin1') then Value end) as Bin1
     , max(case when substring(Metric,1,4) in ('Bin2') then Value end) as Bin2
     , max(case when substring(Metric,1,4) in ('Bin3') then Value end) as Bin3
     , max(case when substring(metric,1,4) in ('Bin4') then Value end) as Bin4
  from imported_results_statistics
group by orig_filename
) 
select orig_filename as scribe_no /* I assume this is meant to reflect the file name?
     , percent_area
     , Bin1 as LessThan75UM     , Bin1 as From75To300UM
     , Bin3 as From30to1MM     , Bin4 as MoreThan1MM
     , case when Bin1 + Bin2 + Bin3 + Bin4 < 10 then 0 as perfect /* (apply your  own specific ranges and rules here *
/*    ...  */
   from cte_allresults

对于perfect/good/fail，您可以在python脚本中定义它

作为一个SQL和shell脚本的痴迷者，我已经记不得这么多年了——最近学习了python——我可以保证这种方法会更快地为您服务——无论如何，将单个文件放入您的python脚本中- 在python脚本中生成bash/cmd脚本作为文本变量

如果您要定期执行此操作，请使用python subprocess.call。您认为这样做会奏效吗

import os
import pandas as pd
import numpy as np
from glob import glob

stock_files = sorted(glob('*result.txt'))

# first, create an empty dataframe with the columns names you want
df = pd.DataFrame(columns = ['percent_area', 'less_than_75um', '75_to_300um', '301_to_1mm', 'more_than_1mm', 'perfect', 'good', 'fail'])
# create auxiliary functions to evaluate the quality of the bins
bin_limits = [10, 5, 3, 0]
check_perfect = lambda x: 1 if sum(x) == 0 else 0
check_good = lambda x: 1 if all(np.array(x) <= bin_limits) else 0
check_fail = lambda x: 1 if all(np.array(x) > bin_limits) else 0

# loop through each file
for file in stock_files:
    # read each file and split the lines with ':' delimiter to have the value for each line and convert it to float
    # store those 4 values under 'values' list variable
    values = list(
                seq(
                    pd.read_csv(
                        file, header = None
                        )
                    )\
                .map(
                    lambda x: x[0].split(':')[1]
                    )\
                .map(
                    lambda x: float(x)
                    )
                )
    
    # check the quality of the bins and store the 3 results in 'bins_quality' list variable
    bins_quality = [check_perfect(values[1::]), check_good(values[1::]), check_fail(values[1::])]

    # add a line to the DataFrame with the name of the file in the index and the above values that correspond to the defined columns
    df.loc[file] = values + bins_quality

据我所知，没有一个模块可以轻松做到这一点，所以最好分几个步骤来完成。你有任何现有的代码，你已经试图让这个工作？我添加了启动模块

import os
import pandas as pd
import numpy as np
from glob import glob

stock_files = sorted(glob('*result.txt'))

# first, create an empty dataframe with the columns names you want
df = pd.DataFrame(columns = ['percent_area', 'less_than_75um', '75_to_300um', '301_to_1mm', 'more_than_1mm', 'perfect', 'good', 'fail'])
# create auxiliary functions to evaluate the quality of the bins
bin_limits = [10, 5, 3, 0]
check_perfect = lambda x: 1 if sum(x) == 0 else 0
check_good = lambda x: 1 if all(np.array(x) <= bin_limits) else 0
check_fail = lambda x: 1 if all(np.array(x) > bin_limits) else 0

# loop through each file
for file in stock_files:
    # read each file and split the lines with ':' delimiter to have the value for each line and convert it to float
    # store those 4 values under 'values' list variable
    values = list(
                seq(
                    pd.read_csv(
                        file, header = None
                        )
                    )\
                .map(
                    lambda x: x[0].split(':')[1]
                    )\
                .map(
                    lambda x: float(x)
                    )
                )
    
    # check the quality of the bins and store the 3 results in 'bins_quality' list variable
    bins_quality = [check_perfect(values[1::]), check_good(values[1::]), check_fail(values[1::])]

    # add a line to the DataFrame with the name of the file in the index and the above values that correspond to the defined columns
    df.loc[file] = values + bins_quality