使用正则表达式（python）从文件中提取值的简单方法_Python_Python 3.x_Regex_Re

使用正则表达式（python）从文件中提取值的简单方法

python python-3.x regex

使用正则表达式（python）从文件中提取值的简单方法,python,python-3.x,regex,re,Python,Python 3.x,Regex,Re,大家好，我目前正在学习python中的正则表达式，为我自己做了一个简单的练习，在这里我有一个文件，其中充满了数据行，我只想从其中包含“outer”的每一行中提取一个值 file.txt ABC 134234ed6 outer + deE 325353ed5 out + ABC 133234ed0 outer + deE 325353ed5 out + ABC 135234ed0 outer + deE 125353ed5 out + ABC 455234ed0 outer + deE

大家好，我目前正在学习python中的正则表达式，为我自己做了一个简单的练习，在这里我有一个文件，其中充满了数据行，我只想从其中包含“outer”的每一行中提取一个值

file.txt

ABC 134234ed6  outer +
deE  325353ed5 out +
ABC 133234ed0 outer +
deE  325353ed5 out +
ABC 135234ed0 outer +
deE 125353ed5  out +
ABC 455234ed0  outer +
deE 125353ed5  out +

在这里，我只需要在每一行中有

外部

的

ed

（6,0,0）之后获得数字我的代码目前正在运行，但我想知道是否有简化的方法来实现这一点，只使用正则表达式

这是我的密码：

main.py

重新导入
将open（'file.txt'）作为f：
行=f.读行（）
regex=re.compile（r'\d+（外部）\+$）
结果=[]
对于行中的行：
match=regex.search（行）
如果匹配：
结果=匹配。组（）
results.append（int（result.split（“”）[0]）#这个
打印（结果）

它打印出我想要的

[6,0,0,0]

。但是逻辑包括拆分字符串，然后获取第一项（标记为

#this

）的行），我相信可以直接将其放入正则表达式中，并且可以使用

group（）

直接提取值

我知道类似的问题已经存在，但我认为我的问题足够具体，你只需要帮助我简化逻辑，谢谢

您可以重构代码并删除所有冗余正则表达式拆分，匹配：

import re

with open('file.txt') as f:
    lines = f.readlines()

reg = re.compile(r'(\d+) +outer \+$')
results = []

for line in lines:
   m = reg.search(line)
   if m:
       results.append( int(m.group(1)) )

print (results)

输出：

[6, 0, 0, 0]

正则表达式详细信息：

[6, 0, 0, 0]

（\d+）

：匹配1+个数字，并在组#1后跟1+空格中捕获该数字。请注意，您只对获取此值感兴趣

（\d+），因此在捕获组中使用此值

```
outer\+
```
：匹配
```
outer
```
，后跟空格和
```
+
```
字符
```
$
```
：匹配结束

您可以重构代码并删除所有冗余的正则表达式拆分，匹配：

import re

with open('file.txt') as f:
    lines = f.readlines()

reg = re.compile(r'(\d+) +outer \+$')
results = []

for line in lines:
   m = reg.search(line)
   if m:
       results.append( int(m.group(1)) )

print (results)

输出：

[6, 0, 0, 0]

正则表达式详细信息：

[6, 0, 0, 0]

（\d+）

：匹配1+个数字，并在组#1后跟1+空格中捕获该数字。请注意，您只对获取此值感兴趣

（\d+），因此在捕获组中使用此值

```
outer\+
```
：匹配
```
outer
```
，后跟空格和
```
+
```
字符
```
$
```
：匹配结束

案例1:
如果存在“外部”
，则必须遵循
“edX”

在这种情况下，可以将字符串与正则表达式匹配

r'(?<=ed)\d+(?=.*\bouter\b)'

r'^(?=.*\bouter\b).*ed(\d+)'

案例2:

外部的，，如果存在，可以在“edX”之前或之后，

在这种情况下，可以将字符串与正则表达式匹配
r'(?<=ed)\d+(?=.*\bouter\b)'

r'^(?=.*\bouter\b).*ed(\d+)'

如果存在匹配项，“ed”
后面的数字将包含在捕获组1中
|
Python的正则表达式引擎执行以下操作
(?<=ed)         : positive lookbehind asserts that current position
                  is immediately preceded by 'ed'
\d+             : match 1+ digits
(?=.*\bouter\b) : positive lookahead asserts that current match is
                  followed by 0+ characters other than line terminators,
                  followed by 'outer' with word boundaries

^               : assert beginning of string
(?=.*\bouter\b) : positive lookahead asserts that the string
                  contains 'outer' with word boundaries
.*ed            : match 0+ characters other than line terminators,
                  followed by 'ed'
(\d+)           : match 1+ digits in capture group 1

单词边界（\b
）的存在是为了避免匹配的单词，例如“router”
和“accounterment”
，案例1:“outer”
，如果存在，必须在“edX”之后
在这种情况下，可以将字符串与正则表达式匹配
r'(?<=ed)\d+(?=.*\bouter\b)'

r'^(?=.*\bouter\b).*ed(\d+)'

案例2:外部的，，如果存在，可以在“edX”之前或之后，

在这种情况下，可以将字符串与正则表达式匹配
r'(?<=ed)\d+(?=.*\bouter\b)'

r'^(?=.*\bouter\b).*ed(\d+)'

如果存在匹配项，“ed”
后面的数字将包含在捕获组1中
|
Python的正则表达式引擎执行以下操作
(?<=ed)         : positive lookbehind asserts that current position
                  is immediately preceded by 'ed'
\d+             : match 1+ digits
(?=.*\bouter\b) : positive lookahead asserts that current match is
                  followed by 0+ characters other than line terminators,
                  followed by 'outer' with word boundaries

^               : assert beginning of string
(?=.*\bouter\b) : positive lookahead asserts that the string
                  contains 'outer' with word boundaries
.*ed            : match 0+ characters other than line terminators,
                  followed by 'ed'
(\d+)           : match 1+ digits in capture group 1

单词边界（\b
）的存在是为了避免匹配单词，例如“router”
和“accounterment”
。基本点是，您应该为您感兴趣提取的regexp部分使用分组括号。最简单的解决方法是将（）
放在\d
周围，而不是外部
，这样您就可以使用匹配组（1）
——参见anubhava的答案。除此之外，由于您实际上已经在将整个文件读入内存，因此显然没有必要通过一次读取一行来减少内存，您实际上可以将其作为字符串读入，然后使用re.finditer
。这将有助于简化代码。例如：
import re

with open('file.txt') as f:
    text = f.read()

regex = re.compile(r'(\d) +outer \+\n')

results = [int(match.group(1)) for match in regex.finditer(text)]

print(results)

这使得：
[6, 0, 0, 0]

请注意，在正则表达式中，现在有\n
（换行符）来替换原始正则表达式中的$
——外部\+
必须后跟换行符

附录
要回答这样一个问题：如果文件真的很大，该怎么办：正如如果文件超出可用内存，就不能使用f.readlines（）
一样，也不能使用f.read（）
。您最好的方法可能是以下方法（类似于anubhava的答案，但避免使用阅读行
）。请注意，在正则表达式中使用捕获组的基本问题仍然是相同的
import re

results = []
matcher = re.compile(r'(\d) +outer \+$').search
with open('file.txt') as f:
    for line in f:
        match = matcher(line)
        if match:
            results.append(int(match.group(1)))

print(results)

基本要点是，对于您感兴趣提取的regexp部分，应该使用分组括号。最简单的解决方法是将（）
放在\d
周围，而不是外部
，这样您就可以使用匹配组（1）
——参见anubhava的答案。除此之外，由于您实际上已经在将整个文件读入内存，因此显然没有必要通过一次读取一行来减少内存，您实际上可以将其作为字符串读入，然后使用re.finditer
。这将有助于简化代码。例如：
import re

with open('file.txt') as f:
    text = f.read()

regex = re.compile(r'(\d) +outer \+\n')

results = [int(match.group(1)) for match in regex.finditer(text)]

print(results)

这使得：
[6, 0, 0, 0]

请注意，在正则表达式中，现在有\n
（换行符）来替换原始正则表达式中的$
——外部\+
必须后跟换行符

附录
要回答文件非常大时该怎么办的问题：正如您不能使用<