Python 正则表达式不'；是否从日志文件中提取整个id？_Python_Regex

Python 正则表达式不'；是否从日志文件中提取整个id？

python regex

Python 正则表达式不'；是否从日志文件中提取整个id？,python,regex,Python,Regex,我在日志文件中有以下输入，我想捕获ID的所有部分，但是它不会返回整个ID，只返回其中的一部分： id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤ id:A2uhasan30hamwix160212145302428 id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ id:A2uhasan30hamwix160

我在日志文件中有以下输入，我想捕获ID的所有部分，但是它不会返回整个ID，只返回其中的一部分：

id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤ 
id:A2uhasan30hamwix160212145302428 
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ 
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ 
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ 
id:A2uhasan30hamwix160207145023750

我在python 2.7中使用了以下正则表达式：

I have edited sid to id:
RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9._+]*))', re.U)

编辑后：以下是我读取日志文件的方式：

with open(cfg.log_file) as input_file: ...
     fields = line.strip().split(' ')

以及日志中的行示例：

2015-11-30T23:58:13.760950+00:00 calxxx enexxxxce[10476]: INFO consume_essor: user:<<"ailxxxied">> callee_num:<<"+144442567413">> id:<<"A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧">> credits:0.0 result:ok provider:sipovvvv1.yv.vs

2015-11-30T23:58:13.760950+00:00 calxxx-Enexxxce[10476]：信息消耗者：用户：被叫人数量：id:credits:0.0结果：确定提供者：sipovvv1.yv.vs

如果能帮助我编辑正则表达式，我将不胜感激。

3件需要解决的问题：

```
id
```
而不是
```
sid
```
使用
```
\d
```
而不是
```
0-9
```
来
无需在
```
sid
```
命名组中添加额外的捕获组

固定版本：

id:(<<")?(?P<sid>[A-Za-z\d_.+]+)

id:(
输出：
['id:A2uhasan30hamwix160212145302428 ', 
 'id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ ', 
 'id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ ', 
 'id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ ', 
 'id:A2uhasan30hamwix160207145023750']

根据我们在聊天中讨论的内容，发布解决方案：
import codecs
import re
RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U) # \d used to match non-ASCII digits, too
input_file = codecs.open(cfg.log_file, encoding='utf-8')  # Read the file with UTF8 encoding
for line in input_file: 
    fields = line.strip().split(u' ') # u prefix is important!
    if len(fields) >= 11: 
    try: 
        # ...... 
        sid = RE_SID.search(fields[7]).group('sid') # Or check if there is a match first

导入编解码器
进口稀土
RE_SID=RE.compile（ur'id:（是否希望正则表达式也捕获阿拉伯数字？请尝试。注意，您的输入中没有SID:
。@WiktorStribiżew我已更改为RE_SID=RE.compile（r'id:（这是什么版本的Python？如何获取输入字符串？请在问题中发布所有相关详细信息。您还需要在r
前缀旁边使用u
。@Wiktor Stribiżew这与我上面所示的完全一样。仅是拉丁字母。Python版本是：2.7。我已经添加了ur作为前缀，不会更改！我会的。）更新更多内容！我已经更改了您在那里编写的内容，结果相同，没有阿拉伯数字RE_SID=RE.compile（r'id:（我想添加，我可以使用和codecs.open（cfg.log_文件，encoding='utf-8'））作为输入文件：
使用编解码器。使用和打开，它也可以完美地工作。有趣的是，在这种情况下，我甚至不需要在…split（u'）
中编写u
，并且将使用字段=line.strip（）.split（“”）@Wiktor Stribiżew
string = '''
id:A2uhasan30hamwix١٦٠٢٢٧١٣٣٣١١٣٥٤ 
id:A2uhasan30hamwix160212145302428 
id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ 
id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ 
id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ 
id:A2uhasan30hamwix160207145023750
'''
import re
reObj = re.compile(r'id:.*')
ans = reObj.findall(string,re.DOTALL)

print(ans)

['id:A2uhasan30hamwix160212145302428 ', 
 'id:A2uhasan30hamwix١٦٠٢٠٩١٣٠١٥٠٠١١ ', 
 'id:A2uhasan30hamwix١٦٠٢٠٩١٦٤٧٣٩٧٣٢ ', 
 'id:A2uhasan30hamwix١٦٠٢٠٨١٩٢٨٠١٩٠٧ ', 
 'id:A2uhasan30hamwix160207145023750']

import codecs
import re
RE_SID = re.compile(ur'id:(<<")?(?P<sid>[A-Za-z\d._+]*)', re.U) # \d used to match non-ASCII digits, too
input_file = codecs.open(cfg.log_file, encoding='utf-8')  # Read the file with UTF8 encoding
for line in input_file: 
    fields = line.strip().split(u' ') # u prefix is important!
    if len(fields) >= 11: 
    try: 
        # ...... 
        sid = RE_SID.search(fields[7]).group('sid') # Or check if there is a match first