Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/assembly/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中从多个文本文件中查找和提取字符串_Python_Arrays_Parsing - Fatal编程技术网

在Python中从多个文本文件中查找和提取字符串

在Python中从多个文本文件中查找和提取字符串,python,arrays,parsing,Python,Arrays,Parsing,我只是在学习Python,为了工作,我查阅了很多PDF,因此我找到了一个PDFMINER工具,可以将目录转换为文本文件。然后我编写了下面的代码,告诉我pdf文件是已批准的索赔还是已拒绝的索赔。我不明白我怎么能说找到以“跟踪标识号…”开头的字符串,然后是后面的18个字符并将其填充到数组中 import os import glob import csv def check(filename): if 'DELIVERY NOTIFICATION' in open(filename).rea

我只是在学习Python,为了工作,我查阅了很多PDF,因此我找到了一个PDFMINER工具,可以将目录转换为文本文件。然后我编写了下面的代码,告诉我pdf文件是已批准的索赔还是已拒绝的索赔。我不明白我怎么能说找到以“跟踪标识号…”开头的字符串,然后是后面的18个字符并将其填充到数组中

import os
import glob
import csv
def check(filename):
    if 'DELIVERY NOTIFICATION' in open(filename).read():
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
    elif 'Dear Customer:' in open(filename).read():
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

def iterate():

    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        filename = infile
        check(filename)


iterate()
任何帮助都将不胜感激。这就是文本文件的外观

Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT.  WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------
更新:许多有用的答案,这是我采取的路线,如果我自己这么说的话,效果相当不错。这将节省大量的时间!!这是我的全部代码,供未来的观众使用

import os
import glob

arrayDenied = []

def iterate():
    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        check(infile)

def check(filename):
    with open(filename, 'rt') as file_contents:
        myText = file_contents.read()
        if 'DELIVERY NOTIFICATION' in myText:
            start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
            myNumber = myText[start : start+18]
            print("Denied: " + myNumber)
            arrayDenied.append(myNumber)
        elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")

startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]

startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]

arrayApproved.append(myNumber + " - " + myClaimNumber)
        else:
            print("I don't know if this is approved or denied")   
iterate()
with open('Approved.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayApproved:
        writer.writerow([val])
with open('Denied.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayDenied:
        writer.writerow([val])
print(arrayDenied) 
print(arrayApproved)

更新:添加了我完成的代码的其余部分,将列表写入CSV文件,在其中执行some=left()之类的命令,几分钟内我就有了1000个跟踪号码。这就是为什么编程很棒。

我认为这解决了您的问题,只需将其转化为函数即可

import re

string = 'Tracking Identification Number...1Z000000YW00000000'

no_dots = re.sub('\.', '', string) #Removes all dots from the string

matchObj = re.search('^Tracking Identification Number(.*)', no_dots) #Matches anything after the "Tracking Identification Number"

try:
   print (matchObj.group(1))
except:
    print("No match!")

如果您想阅读文档,请点击此处:

如果您的目标只是查找“跟踪标识号…”字符串和随后的18个字符;您只需找到该字符串的索引,然后到达其结束处,并从该点开始切片,直到随后的18个字符结束

# Read the text file into memory:
with open(filename, 'rt') as txt_file:
    myText = txt_file.read()
    if 'DELIVERY NOTIFICATION' in myText:
        # Find the desired string and get the subsequent 18 characters:
        start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
        myNumber = myText[start : start+18]
        arrayDenied.append(myNumber)

您还可以将追加行修改为
arrayDenied.append(myText+''+myNumber)
或类似的内容。

正则表达式是执行任务的方法。下面是一种修改代码以搜索模式的方法

import re
pattern = r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

def check(filename):
    file_contents = open(filename, 'r').read()
    if 'DELIVERY NOTIFICATION' in file_contents:
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
        matches = re.finditer(pattern, test_str)
        for match in matches:
            print("Tracking Number = %s" % match.group().strip("."))
    elif 'Dear Customer:' in file_contents:
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")
重新导入

图案=r“(?文件中的点是否真的存在?跟踪编号是否总是以1Z开头的18个字符?是的,我有1000个PDF要处理,通常我会将它们复制并粘贴到excel表格中,因此我尝试自动化这个痛苦的过程。批准PDF有点不同,但是的,基本上它们的结构都是相同的语法已关闭。请查看我的答案,并告诉我这是否解决了问题。@Bluestreak22您通常也应该避免手动打开文件,例如
open(filename).read()
。您可以使用
和open()打开文件一次
,然后进行
检查以及其中的所有其他操作。我在回答中介绍了这一点。如果跟踪编号后面有额外的内容,如
s=“跟踪标识号…1Z000000YW00000000…额外的内容”
@pault他显示的文件在该编号的末尾有一个换行符,那么它应该停止呃,每当我看到这样的事情我都不会撒谎“(?:(\.+)[A-Z-A-z0-9]{18"我得到了heeby jeebies,想得像个废话一样,哈哈,我会尝试一下这个答案和其他答案,只是为了知道做某事的两种方法。@Bluestreak22我绝对不是正则表达式专家,但我发现这个网站在测试模式方面非常有用。将你的文本粘贴在那里,选择你的编程语言,并尝试制作你自己的模式。我只是在这个问题上摇摆了一下。我得到了一个回溯错误,跟踪标识号…不在列表中。我认为这是因为它没有正确读取文本文件,或者可能是因为原始文本文件中没有空格,有一个字符串聚集在一起?实际上我所做的就是删除。splitlines()它成功了:)@Bluestreak22哦,太棒了,没错!很高兴它成功了!:)相应地编辑了答案。你能解释一下字符串的索引是什么吗?对我来说,索引是数组中的一个值,但据我所知,字符串或文本文件不是数组?索引是字符串中子字符串的起始位置。假设你的字符串是
myText=“helloabc1234hello”
,然后
start=myText.index(“abc”)
给你5,因为它从
myText
的第5个索引开始。然后你加上
abc
的长度以达到它的结尾。这个索引就是你感兴趣的
1234
开始的地方,因此你需要
myText[start:start+4]
来获得这4个字符。