Python 从字符串中提取信息

Python 从字符串中提取信息,python,string,Python,String,以字符串形式给出以下信息: [:T102684-1 coord=“107,20885,18”:]27.[:/T102684-1:][:T102684-2 coord=“140,16885,18”:]A.[:/T102684-2:][:T102684-3 coord=“162,57885,18”:]Frankke[:/T102684-3:][:T102684-4 coord=“228,5885,18”::[:/T102684-4:][:T102684-5 coord=“240,27885,18”:

以字符串形式给出以下信息:

[:T102684-1 coord=“107,20885,18”:]27.[:/T102684-1:][:T102684-2 coord=“140,16885,18”:]A.[:/T102684-2:][:T102684-3 coord=“162,57885,18”:]Frankke[:/T102684-3:][:T102684-4 coord=“228,5885,18”::[:/T102684-4:][:T102684-5 coord=“240,27885,18”:]模具[:/T102684-5:][:T102684-6 coord=“274,42885,18”:]阿尔卑斯山脉[:/T102684-6:][:T102684-7 coord=“325,64885,18”:]文字[:/T102684-7:][:T102684-8 coord=“398,25885,18”:]des[:/T102684-8:][:T102684-9 coord=“427,46885,18”:]Jahres[:/T102684-9:][:T102684-10 coord=“480,33885,18”:]1888[:/T102684-10:][:T102684-11 coord=“527,29885,18”:]475[:/T102684-11:]

如何提取制表符ID(此处:T102684)、令牌ID(“-”后的数字)、坐标(107,20885,18)和令牌本身(“27”)? 我使用了简单的查找方法,但它不起作用

for tok in ele.text.split():
        print tok.find("[:T")
        print tok.rfind(":]")
        print tok[(tok.find("[:T")+2):tok.rfind("-")]

谢谢你的帮助

您可以使用正则表达式:

>>> import re
>>> s = '[:T102684-1 coord="107,20,885,18":]27.[:/T102684-1:] [:T102684-2 coord="140,16,885,18":]A.[:/T102684-2:] [:T102684-3 coord="162,57,885,18":]Francke[:/T102684-3:][:T102684-4 coord="228,5,885,18":]:[:/T102684-4:] [:T102684-5 coord="240,27,885,18":]Die[:/T102684-5:] [:T102684-6 coord="274,42,885,18":]alpine[:/T102684-6:] [:T102684-7 coord="325,64,885,18":]Literatur[:/T102684-7:] [:T102684-8 coord="398,25,885,18":]des[:/T102684-8:] [:T102684-9 coord="427,46,885,18":]Jahres[:/T102684-9:] [:T102684-10 coord="480,33,885,18":]1888[:/T102684-10:] [:T102684-11 coord="527,29,885,18":]475[:/T102684-11:]'
>>> r = re.compile(r'''\[:/?T(?P<token_id>\d+)-(?P<id>\d+)\s+coord="
                    (?P<coord>(\d+,\d+,\d+,\d+))":\](?P<token>\w+)''', flags=re.VERBOSE)
>>> for m in r.finditer(s):
        print m.groupdict()


{'token_id': '102684', 'token': '27', 'id': '1', 'coord': '107,20,885,18'}
{'token_id': '102684', 'token': 'A', 'id': '2', 'coord': '140,16,885,18'}
{'token_id': '102684', 'token': 'Francke', 'id': '3', 'coord': '162,57,885,18'}
{'token_id': '102684', 'token': 'Die', 'id': '5', 'coord': '240,27,885,18'}
{'token_id': '102684', 'token': 'alpine', 'id': '6', 'coord': '274,42,885,18'}
{'token_id': '102684', 'token': 'Literatur', 'id': '7', 'coord': '325,64,885,18'}
{'token_id': '102684', 'token': 'des', 'id': '8', 'coord': '398,25,885,18'}
{'token_id': '102684', 'token': 'Jahres', 'id': '9', 'coord': '427,46,885,18'}
{'token_id': '102684', 'token': '1888', 'id': '10', 'coord': '480,33,885,18'}
{'token_id': '102684', 'token': '475', 'id': '11', 'coord': '527,29,885,18'}
>>重新导入
>>>以下[[[[[[[T102684-2-4-2-2-4-4-4-4-4-4-4-4-4-4-4-2 coord=“140,16885,18,18,,,,[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[T10202684-4-4-4-4-4-4-4-4-4-4-4-4-2-4-4-4-4-4-4-2-4-4-4-4-4-4-4-4-4-4-4-4-2]合作合作合作=“140,16885,16885,16885,18,5,18,5,18]5,18,5,18,5,18,18,18]18,,18]18]18]18]18]18]18]18]18]18]]]]]]4-7坐标=“325,64885,18::]文学[:/T102684-7:][:T102684-8 coord=“398,25885,18:]des[:/T102684-8:][:T102684-9 coord=“427,46885,18:]Jahres[:/T102684-9:][:T102684-10 coord=“480,33885,18:][:T102684-10:][:T102684-11 coord=“527,29885:][:T102684-11:][
>>>r=re.compile(r''\[:/?T(?P\d+)-(?P\d+)\s+coord=”
(?P(\d+,\d+,\d+,\d+):\](?P\w+”,flags=re.VERBOSE)
>>>对于r.finditer中的m:
打印m.groupdict()
{'token_id':'102684','token':'27','id':'1','coord':'107,20885,18'}
{'token_id':'102684','token':'A','id':'2','coord':'140,16885,18'}
{'token_id':'102684','token':'Francke','id':'3','coord':'162,57885,18'}
{'token_id':'102684','token':'Die','id':'5','coord':'240,27885,18'}
{'token_id':'102684','token':'alpine','id':'6','coord':'274,42885,18'}
{'token_id':'102684','token':'Literatur','id':'7','coord':'325,64885,18'}
{'token_id':'102684','token':'des','id':'8','coord':'398,25885,18'}
{'token_id':'102684','token':'Jahres','id':'9','coord':'427,46885,18'}
{'token_id':'102684','token':'1888','id':'10','coord':'480,33885,18'}
{'token_id':'102684','token':'475','id':'11','coord':'52729885,18'}

为什么它不起作用。你所展示的代码有什么问题?你能再加入一些示例行吗?完成:)它不起作用,因为我没有得到所需信息的正确开头和结尾。这是一段。我做了以下操作:span_元素=数据。全部查找('span'))对于span_元素中的ele:print ele.text是的,这正是我想要的。非常感谢!:)