Python 使用正则表达式导入大量数据_Python_Regex

Python 使用正则表达式导入大量数据

python regex

Python 使用正则表达式导入大量数据,python,regex,Python,Regex,我需要将大量建筑代码从文本文件导入SQL数据库。到目前为止，我已经编写了以下代码，它成功地返回了代码编号和标题。如何将代码标题后的文本与下一个代码的开头匹配 Test.txt： 101.1标题。这是一个示例代码 101.1.2当地费用。当地管辖区可收取建筑许可证费用违反第300.1节的规定 import re file=open(r'C:\Test.txt','r') text=file.read() codes=re.findall('(\d{3,4}.[\d.]+?){1}\s([\w

我需要将大量建筑代码从文本文件导入SQL数据库。到目前为止，我已经编写了以下代码，它成功地返回了代码编号和标题。如何将代码标题后的文本与下一个代码的开头匹配

Test.txt：

101.1标题。这是一个示例代码

101.1.2当地费用。当地管辖区可收取建筑许可证费用违反第300.1节的规定

import re

file=open(r'C:\Test.txt','r')
text=file.read()

codes=re.findall('(\d{3,4}.[\d.]+?){1}\s([\w\s]+[.]){1}',text)
for code in codes:
    print code[0],code[1]

这导致：

101.1标题。我想让代码[3]打印“这是一个示例代码”

101.1.2本地费用。

在我看来，您记录的标题是最后一个数字（如果您在字符串字符上递增）和第一个期间之间的文本，该期间既不以数字开头，也不以数字结尾

使用这个方法，你可以在整行中确定分隔记录两部分的时间段的位置，并以此为基础进行分割。然后将这两个字符串缓存为一对，或者直接将它们插入数据库

如果您的文件只是要提取的构建代码，并且不包含任何其他无用的数据，我建议放弃正则表达式并使用这种方法。

在我看来，您记录的标题是最后一个数字之间的文本（如果您在字符串字符上递增），第一个周期既不以数字开头也不以一结尾

如果您的文件只是构建您想要提取的代码，并且不包含任何其他无用的数据，我建议放弃正则表达式并使用这种方法。

使用

re.split

而不是

re.findall

。就你而言：

>>> re.split('(\d{3,4}.[\d.]+?){1}\s([\w\s]+[.]){1}',text)

['', '101.1', 'Title.', ' This is an example code.\n\n', '101.1.2', 'Local Fees.', ' The local jurisdiction may charge fees for building permit violations per Section 300.1.\n']

使用

re.split

代替

re.findall

。就你而言：

>>> re.split('(\d{3,4}.[\d.]+?){1}\s([\w\s]+[.]){1}',text)

['', '101.1', 'Title.', ' This is an example code.\n\n', '101.1.2', 'Local Fees.', ' The local jurisdiction may charge fees for building permit violations per Section 300.1.\n']

我将使用以下选项（未测试）

请注意，{1}是无用的

我将使用以下代码（未测试）

import sys
import re

SECTION = r'\d{3,4}\.[\d.]*'
LABEL = r'[^.\r\n]*'
TITLE = re.compile(r'^({section})\s*({label})\W*'.format(section=SECTION, label=LABEL), re.MULTILINE)

def process(fname):
    with open(fname, 'r') as inf:
        txt = inf.read()
    res = TITLE.split(txt)
    it = iter(res)
    it.next()         # discard leading text
    return zip(it, it, it)

def main():
    args = sys.argv[1:] or ['c:/test.txt']
    for fname in args:
        res = process(fname)
        # do something with res

if __name__=="__main__":
    main()

请注意，{1}是无用的

import sys
import re

SECTION = r'\d{3,4}\.[\d.]*'
LABEL = r'[^.\r\n]*'
TITLE = re.compile(r'^({section})\s*({label})\W*'.format(section=SECTION, label=LABEL), re.MULTILINE)

def process(fname):
    with open(fname, 'r') as inf:
        txt = inf.read()
    res = TITLE.split(txt)
    it = iter(res)
    it.next()         # discard leading text
    return zip(it, it, it)

def main():
    args = sys.argv[1:] or ['c:/test.txt']
    for fname in args:
        res = process(fname)
        # do something with res

if __name__=="__main__":
    main()

运行test.txt，返回

[
    ('101.1', 'Title', 'This is an example code.\n\n'),
    ('101.1.2', 'Local Fees', 'The local jurisdiction may charge fees for building permit violations per Section 300.1.\n\n')
]

运行test.txt，返回

[
    ('101.1', 'Title', 'This is an example code.\n\n'),
    ('101.1.2', 'Local Fees', 'The local jurisdiction may charge fees for building permit violations per Section 300.1.\n\n')
]