如何在python上将多个.txt文件中的信息组织到一个数据库中?
链接上方是一个名为19031783_result.txt的结果文件。每个.txt文件都包含统计结果,我想将其组织到数据库中 所有结果文件的数据库输出应如下所示: 因此,我有数百个结果文件需要合并到一个数据库中。最后三列是每个料仓缺陷计数的限制,例如,料仓1限制为10,料仓2限制为5,料仓3限制为3,料仓4限制为0。所以完美意味着没有缺陷,好意味着它在规格范围内,失败意味着它在极限之上 我在python方面没有太多经验,我需要一个如何从.txt文件创建这个数据库的指导。Python使用起来更好,因为它可以处理大量数据,而且速度更快如何在python上将多个.txt文件中的信息组织到一个数据库中?,python,python-3.x,excel,pandas,dataframe,Python,Python 3.x,Excel,Pandas,Dataframe,链接上方是一个名为19031783_result.txt的结果文件。每个.txt文件都包含统计结果,我想将其组织到数据库中 所有结果文件的数据库输出应如下所示: 因此,我有数百个结果文件需要合并到一个数据库中。最后三列是每个料仓缺陷计数的限制,例如,料仓1限制为10,料仓2限制为5,料仓3限制为3,料仓4限制为0。所以完美意味着没有缺陷,好意味着它在规格范围内,失败意味着它在极限之上 我在python方面没有太多经验,我需要一个如何从.txt文件创建这个数据库的指导。Python使用起来更好,
import os
import pandas as pd
from glob import glob
stock_files = sorted(glob('*result.txt'))
stock_files
df = pd.concat([pd.read_csv(file, sep="\t").assign(filename = file) for file in stock_files], ignore_index = True)
df = pd.DataFrame() #this is the bit I am stuck on
这是我当前的输出,我需要将其清理并转换为数据库,我有一个excel电子表格的屏幕截图:
开始简单-python可能不是胡佛和重塑活动的基本数据的朋友
我将用bash和windowscmd演示如何使用基本操作系统命令将所有文件合并到一个文件中
对于bash-安装WSL和首选的Linux发行版
如果没有太多的假设,使用Bash脚本几乎是一行
读取结果文件列表-并对每个文件进行分类,并在每个文件的开头插入文件名-存储到单个文件AllResultsFiles.txt中
ls -1 *_result.txt | while read fname; do cat $fname | while read _line; do echo $fname:$_line; done; done > AllResultsFiles.txt
对于windows-稍微复杂一点-但这是相同的-将其存储到C:\users\me\data\merges.cmd文件中:
然后按如下方式运行
C:\users\me\data\> mergers.cmd
with cte_allresults as (
select orig_filename
, max(case when substring(Metric,1,4) in 'Dela' then Value) as percent_area
, max(case when substring(Metric,1,4) in ('Bin1') then Value end) as Bin1
, max(case when substring(Metric,1,4) in ('Bin2') then Value end) as Bin2
, max(case when substring(Metric,1,4) in ('Bin3') then Value end) as Bin3
, max(case when substring(metric,1,4) in ('Bin4') then Value end) as Bin4
from imported_results_statistics
group by orig_filename
)
select orig_filename as scribe_no /* I assume this is meant to reflect the file name?
, percent_area
, Bin1 as LessThan75UM , Bin1 as From75To300UM
, Bin3 as From30to1MM , Bin4 as MoreThan1MM
, case when Bin1 + Bin2 + Bin3 + Bin4 < 10 then 0 as perfect /* (apply your own specific ranges and rules here *
/* ... */
from cte_allresults
这将创建一个包含所有文件内容的单个文件,并在第一列中使用冒号:分隔符-每行包含原始文件的名称
然后,可以使用冒号作为分隔符,轻松地将其导入到电子表格或具有三列的数据库中
create table imported_results_statistics
(Orig_filename varchar(100),
Metric varchar(200),
Value int
)
一旦导入到数据库表中,您就可以使用SQL操作来创建一个新表—根据文件名转换每组记录-
sqlite很简单,但需要更多的步骤
如果您有一个更强大的数据库工具包,您可以得到如下压缩结果
C:\users\me\data\> mergers.cmd
with cte_allresults as (
select orig_filename
, max(case when substring(Metric,1,4) in 'Dela' then Value) as percent_area
, max(case when substring(Metric,1,4) in ('Bin1') then Value end) as Bin1
, max(case when substring(Metric,1,4) in ('Bin2') then Value end) as Bin2
, max(case when substring(Metric,1,4) in ('Bin3') then Value end) as Bin3
, max(case when substring(metric,1,4) in ('Bin4') then Value end) as Bin4
from imported_results_statistics
group by orig_filename
)
select orig_filename as scribe_no /* I assume this is meant to reflect the file name?
, percent_area
, Bin1 as LessThan75UM , Bin1 as From75To300UM
, Bin3 as From30to1MM , Bin4 as MoreThan1MM
, case when Bin1 + Bin2 + Bin3 + Bin4 < 10 then 0 as perfect /* (apply your own specific ranges and rules here *
/* ... */
from cte_allresults
对于perfect/good/fail,您可以在python脚本中定义它
作为一个SQL和shell脚本的痴迷者,我已经记不得这么多年了——最近学习了python——我可以保证这种方法会更快地为您服务——无论如何,将单个文件放入您的python脚本中-
在python脚本中生成bash/cmd脚本作为文本变量
如果您要定期执行此操作,请使用python subprocess.call。您认为这样做会奏效吗
import os
import pandas as pd
import numpy as np
from glob import glob
stock_files = sorted(glob('*result.txt'))
# first, create an empty dataframe with the columns names you want
df = pd.DataFrame(columns = ['percent_area', 'less_than_75um', '75_to_300um', '301_to_1mm', 'more_than_1mm', 'perfect', 'good', 'fail'])
# create auxiliary functions to evaluate the quality of the bins
bin_limits = [10, 5, 3, 0]
check_perfect = lambda x: 1 if sum(x) == 0 else 0
check_good = lambda x: 1 if all(np.array(x) <= bin_limits) else 0
check_fail = lambda x: 1 if all(np.array(x) > bin_limits) else 0
# loop through each file
for file in stock_files:
# read each file and split the lines with ':' delimiter to have the value for each line and convert it to float
# store those 4 values under 'values' list variable
values = list(
seq(
pd.read_csv(
file, header = None
)
)\
.map(
lambda x: x[0].split(':')[1]
)\
.map(
lambda x: float(x)
)
)
# check the quality of the bins and store the 3 results in 'bins_quality' list variable
bins_quality = [check_perfect(values[1::]), check_good(values[1::]), check_fail(values[1::])]
# add a line to the DataFrame with the name of the file in the index and the above values that correspond to the defined columns
df.loc[file] = values + bins_quality
据我所知,没有一个模块可以轻松做到这一点,所以最好分几个步骤来完成。你有任何现有的代码,你已经试图让这个工作?我添加了启动模块
import os
import pandas as pd
import numpy as np
from glob import glob
stock_files = sorted(glob('*result.txt'))
# first, create an empty dataframe with the columns names you want
df = pd.DataFrame(columns = ['percent_area', 'less_than_75um', '75_to_300um', '301_to_1mm', 'more_than_1mm', 'perfect', 'good', 'fail'])
# create auxiliary functions to evaluate the quality of the bins
bin_limits = [10, 5, 3, 0]
check_perfect = lambda x: 1 if sum(x) == 0 else 0
check_good = lambda x: 1 if all(np.array(x) <= bin_limits) else 0
check_fail = lambda x: 1 if all(np.array(x) > bin_limits) else 0
# loop through each file
for file in stock_files:
# read each file and split the lines with ':' delimiter to have the value for each line and convert it to float
# store those 4 values under 'values' list variable
values = list(
seq(
pd.read_csv(
file, header = None
)
)\
.map(
lambda x: x[0].split(':')[1]
)\
.map(
lambda x: float(x)
)
)
# check the quality of the bins and store the 3 results in 'bins_quality' list variable
bins_quality = [check_perfect(values[1::]), check_good(values[1::]), check_fail(values[1::])]
# add a line to the DataFrame with the name of the file in the index and the above values that correspond to the defined columns
df.loc[file] = values + bins_quality