帮助在python中解析文本文件_Python

帮助在python中解析文本文件

python

帮助在python中解析文本文件,python,Python,我一直在努力解决这个问题已经有一段时间了，我有很多特定格式的文本文件，我需要从中提取所有数据和文件到数据库的不同字段中。这场斗争是调整解析参数，确保我正确获得所有信息格式如下： WHITESPACE HERE of unknown length. K PA DETAILS 2 4565434 i need this sentace as one DB record 2 4456788 and this one 5 4879870 as well as this

我一直在努力解决这个问题已经有一段时间了，我有很多特定格式的文本文件，我需要从中提取所有数据和文件到数据库的不同字段中。这场斗争是调整解析参数，确保我正确获得所有信息

格式如下：

WHITESPACE HERE of unknown length.
K       PA   DETAILS 
2 4565434   i need this sentace as one DB record
2 4456788   and this one 
5 4879870   as well as this one, content will vary! 

X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.

最困难的部分是1）消除空白，2）定义彼此之间的字段，请参见下面我的最佳尝试：

dict = {}
    XX = (open("XX.txt", "r")).readlines()

    for line in XX:
            if line.isspace():
            pass
        elif line.startswith('There is'):
            pass
        elif line.startswith('Max', 2):
            pass
        elif line.startswith('K'):
            pass
        else:
            for word in line.split():
                if word.startswith('4'):                    
                    tmp_PA = word
                elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
                    tmp_K = word
                else:
                    tmp_DETAILS = word
            cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))

在这一刻，我可以拉K&PA字段没有问题，但是我的细节只是拉一个单词，我需要整个句子，或者至少25个字符

非常感谢您的阅读，我希望您能提供帮助！：）

如果我正确理解您的文件格式，您可以尝试使用此脚本

filename = 'bug.txt'
f = file(filename,'r')

foundHeaders = False

records = []

for rawline in f:
    line = rawline.strip()

    if not foundHeaders:
        tokens = line.split()
        if tokens == ['K','PA','DETAILS']:
            foundHeaders = True
        continue

    else:
        tokens = line.split(None,2)
        if len(tokens) != 3:
            break

        try:
            K = int(tokens[0])
            PA = int(tokens[1])
        except ValueError:
            break

        records.append((K,PA,tokens[2]))


f.close()

for r in records:
    print r # replace this by your DB insertion code

这将在遇到标题行时开始读取记录，并在该行的格式不再为（K、PA、description）时停止

希望这有帮助

你把整句话都分成了几个字。你需要分成第一个单词、第二个单词和其他单词。如

line.split（无，2）

它可能会使用正则表达式。使用oposite逻辑，即如果它从数字1到5开始，使用它，否则通过。比如：

pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
    m = pattern.match(line)
    if m:
        cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
            (m.group(2), m.group(1), m.group(3)))

哦，当然，你应该使用事先准备好的声明。解析SQL要比执行SQL慢几个数量级。

下面是我使用re的尝试

import re

stuff = open("source", "r").readlines()

whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K       PA   DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")

for line in stuff:
    if whitey.match(line):
        pass
    elif header.match(line):
        pass
    elif juicy_info.match(line):
        result = juicy_info.search(line)
        print result.group('third')
        print result.group('second')
        print result.group('first')

重新导入
stuff=开放（“源代码”，“r”）.readlines（）
whitey=re.compile（r“^[\s]+$”）
标题=重新编译（r“K PA详细信息”）
juicy_info=re.compile（r“^（？P[\d]）\s（？P[\d]+）\s（？P.+）$）
对于在线输入的内容：
如果为白色，则匹配（行）：
通过
elif标题匹配（行）：
通过
elif juicy_信息匹配（行）：
结果=多汁信息搜索（行）
打印结果。组（“第三个”）
打印结果。组（'第二个'）
打印结果。组（'第一个'）

使用re，我可以随心所欲地提取数据并对其进行操作。如果你只需要有趣的信息行，你实际上可以进行所有其他检查，使这是一个非常简洁的脚本

import re

stuff = open("source", "r").readlines()

#create a regular expression using subpatterns. 
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")

for line in stuff:
    result = juicy_info.search(line)
    if result:#do stuff with data here just use the tag we declared earlier.            
        print result.group('third')
        print result.group('second')
        print result.group('first')

重新导入
stuff=开放（“源代码”，“r”）.readlines（）
#使用子模式创建正则表达式。
#“第一”、“第二”和“第三”是我们自己的标签，
#我们可以叫他们亚当、贝蒂等等。
juicy_info=re.compile（r“^（？P[\d]）\s（？P[\d]+）\s（？P.+）$）
对于在线输入的内容：
结果=多汁信息搜索（行）
if result:#在这里处理数据，只需使用我们前面声明的标记。
打印结果。组（“第三个”）
打印结果。组（'第二个'）
打印结果。组（'第一个'）

我更喜欢使用[\t]而不是\s，因为\s匹配以下字符：
空白，'\f'，'\n'，'\r'，'\t'，'\v'
我看不出有任何理由使用一个符号来表示更多要匹配的内容，在不应该匹配的地方匹配不稳定的换行符是有风险的

编辑这样做就足够了：

import re

reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)

with open('XX.txt') as f:
    for mat in reg.finditer(f.read()):
        cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
                   mat.group(2,1,3)

word==“1”或word==“2”或word==“3”或word==“4”或word==“5”：

可以在['1'、'2'、'3'、'4'、'5']中重写为

word

。基本上，您有一行要从中提取前两个数字，您可以使用正则表达式匹配，或者只需将拆分后的第一项放入tmp\PA，第二个在tmp_K中，并在tmp_细节中连接其余的。（顺便说一句，我想你会遇到麻烦，像

4467493…

）doh，当我编写脚本时，人们忍者攻击了我。我不认为OP意味着有固定数量的条目。所以你最初的解决方案是错误的。更新后的应该可以工作。

import re

reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)

with open('XX.txt') as f:
    for mat in reg.finditer(f.read()):
        cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
                   mat.group(2,1,3)