Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/cmake/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中通过多个文本文件循环到数据提取_Python_Database Design_Data Manipulation - Fatal编程技术网

在Python中通过多个文本文件循环到数据提取

在Python中通过多个文本文件循环到数据提取,python,database-design,data-manipulation,Python,Database Design,Data Manipulation,为了从大约一百万个文本文件中提取数据,我从10月份开始自学Python。我一直在尝试以小而谨慎的方式处理这个问题,这样我就不会在一次获得所有我想要的代码时不知所措 对于我的第一个区块,我想从文本文件中提取地址。到目前为止,我已经设法让我的代码一次工作一个文件,但由于我有超过100万要处理,我不认为手动这样做会起作用 为了简洁起见,我只包含了代码的第一部分,因为其余部分基本上都是这个,但寻找不同的关键字 ###### #Importing/creating modules ###### impor

为了从大约一百万个文本文件中提取数据,我从10月份开始自学Python。我一直在尝试以小而谨慎的方式处理这个问题,这样我就不会在一次获得所有我想要的代码时不知所措

对于我的第一个区块,我想从文本文件中提取地址。到目前为止,我已经设法让我的代码一次工作一个文件,但由于我有超过100万要处理,我不认为手动这样做会起作用

为了简洁起见,我只包含了代码的第一部分,因为其余部分基本上都是这个,但寻找不同的关键字

######
#Importing/creating modules
######
import os
import re
regex = re.compile('\d+')
######
#Creating N/A text for non-existing entries
######
CV2 = """N/A
N/A
"""


######
#Opening Database file 
######
target = open('C:/project/database.csv', 'a')



####
#Opening the source documents
####
big_file = open('C:/project/0000068100-99-000018.txt', 'r')




####
#Looking for central key
####
for line in big_file:
    if 'CENTRAL INDEX KEY:' in line:
        key = regex.findall(line)


####
#Looking for Conforming Period
####
big_file = open('C:/project/0000068100-99-000018.txt', 'r') 
for line in big_file:
    if 'CONFORMED PERIOD OF REPORT:' in line:
        conform = regex.findall(line)


#####
#Looking for File Type
#####       
big_file = open('C:/project/0000068100-99-000018.txt', 'r') 
for line in big_file:
    if 'CONFORMED SUBMISSION TYPE:' in line:
        type_temp = re.split('\s+',line) 
        type1 = type_temp [3]
        type = type1.split()



#####
#Looking for company name
#####   
big_file = open(''C:/project/0000068100-99-000018.txt', 'r')    
for line in big_file:
    if 'COMPANY CONFORMED NAME:' in line:
        name_temp = re.split('\:+',line) 
        name1 = name_temp [1]
        name2 = name1.split()
        name3 = ' '.join(name2)
        name = re.split('\::+',name3) 





####
#Looking for Street and Mail Addresses
####

big_file = open('C:/project/0000068100-99-000018.txt', 'r')
f = open('C:/Users/Martin/Thesis/address1.txt', 'w+')
for line in big_file:
    if 'STREET 1:' in line:
        f.write(line)



f = open('C:/project/address1.txt', 'w+')
a = open('C:/project/address1a.txt', 'w+')
b = os.path.getsize('C:/project/address1.txt')
########
#If empty, return NA, if not clean unnesseary tabs and formmating from line 
########
if b == 0:
    a.write(CV2)        
else:
    for line in f:
        type_temp = re.split('\:+',line) 
        add_temp = type_temp[1]
        add_temp_temp = ''.join(add_temp)
        add = re.split('\t+',add_temp_temp)
        tempo = "".join(add)
        a.write(tempo)
        #print tempo


a = open('C:/project/address1a.txt', 'w+')
lines=a.readlines()
bus1 = lines[0]
bus2 = re.split('\n+',bus1)
bus3 = bus2 [0]
business1 = re.split('\::+',bus3)
########
#Preping for inclusion for the Database entry 
########
mail1 = lines[1]
mail2 = re.split('\n+',mail1)
mail3 = mail2 [0]
mailad1 = re.split('\::+',mail3)


#####
#Formatting data tags
#####
company_name_temp = "Company Name"
company_name = re.split('\::+',company_name_temp)
bus_street1_temp = "Business Street 1"
bus_street1 = re.split('\::+',bus_street1_temp)
mstreet1_temp = key + conform + type + mail_street1 + mailad1
mstreet1 = ','.join(mstreet1_temp)




######
#Prepping for database
######
name_temp1 = key + conform + type + company_name + name
co_name = ','.join(name_temp1)

bstreet1_temp = key + conform + type + bus_street1 + business1
bstreet1 = ','.join(bstreet1_temp)

mstreet1_temp = key + conform + type + mail_street1 + mailad1
mstreet1 = ','.join(mstreet1_temp)



######
#Writing to database
######
target.write(co_name)
target.write("\n")
target.write(bstreet1)
target.write("\n")
target.write(mstreet1)
target.write("\n")
我试图在顶部打开一次文件并多次调用该变量,但它不起作用,我假设for循环看起来与此类似,但我不知道如何使其起作用

for filename in os.listdir('C:/project'):
    bigfile = filename

谢谢

这是因为
open
返回的
FileObjects
实际上是流式的,所以您不能多次访问它,实际上只能访问一次。你想在这里做的事情更像这样:

for filename in os.listdir('C:\project'):
    bigfile = open(filename, 'r').read()
    # Now the file contents are saved within bigfile, and you can do as you please,
    #  accessing multiple times.
如果文件很大,则不建议这样做,因为Python读入并保存整个文件需要很多时间。这就是
open
流式处理文件数据的确切原因,这样您就可以一次访问任意多的文件数据,而不是被迫一次获取整个数据块


顺便说一句,您应该查看
csv
模块。

感谢您迄今为止的帮助,但我想我遗漏了一些东西。我试着实现了你的建议(同样是从小处开始——只是搜索ID键),但我没有在键下得到任何东西。我尝试了你建议的代码,但是当我打印大文件时,我得到了一个空行,或者当我打印文件名时,我得到了一个文件列表。哎呀,忘了在每个子块中重新打开我的文件。它正在工作。