Python中字符串的高效解析

Python中字符串的高效解析,python,string,parsing,Python,String,Parsing,我目前拥有解析imdbAPI http返回的以下代码: text = 'unicode: {"Title":"The Fountain","Year":"2006","Rated":"R","Released":"22 Nov 2006","Genre":"Drama, Romance, Sci-Fi","Director":"Darren Aronofsky","Writer":"Darren Aronofsky, Darren Aronofsky","Actors":"Hugh Jackma

我目前拥有解析imdbAPI http返回的以下代码:

text = 'unicode: {"Title":"The Fountain","Year":"2006","Rated":"R","Released":"22 Nov 2006","Genre":"Drama, Romance, Sci-Fi","Director":"Darren Aronofsky","Writer":"Darren Aronofsky, Darren Aronofsky","Actors":"Hugh Jackman, Rachel Weisz, Sean Patrick Thomas, Ellen Burstyn","Plot":"Spanning over one thousand years, and three parallel stories, The Fountain is a story of love, death, spirituality, and the fragility of our existence in this world.","Poster":"http://ia.media-imdb.com/images/M/MV5BMTU5OTczMTcxMV5BMl5BanBnXkFtZTcwNDg3MTEzMw@@._V1_SX320.jpg","Runtime":"1 hr 36 mins","Rating":"7.4","Votes":"100139","ID":"tt0414993","Response":"True"}'

def stripData(tag="Title"):
    tag_start = text.find(tag)
    data_start = tag_start + len(tag)+3
    data_end = text.find('"',data_start)
    data = text[data_start:data_end]
    return tag, data  

我想知道:有没有更好的方法来实现我所缺少的功能?

在删除所有不必要的头字符和尾字符后,您可以尝试将所有数据转换为dict

>>> ast.literal_eval(text.split(' ', 1)[1])
{'Plot': 'Spanning over one thousand years, and three parallel stories, The Fountain is a story of love, death, spirituality, and the fragility of our existence in this world.', 'Votes': '100139', 'Rated': 'R', 'Response': 'True', 'Title': 'The Fountain', 'Poster': 'http://ia.media-imdb.com/images/M/MV5BMTU5OTczMTcxMV5BMl5BanBnXkFtZTcwNDg3MTEzMw@@._V1_SX320.jpg', 'Writer': 'Darren Aronofsky, Darren Aronofsky', 'ID': 'tt0414993', 'Director': 'Darren Aronofsky', 'Released': '22 Nov 2006', 'Actors': 'Hugh Jackman, Rachel Weisz, Sean Patrick Thomas, Ellen Burstyn', 'Year': '2006', 'Genre': 'Drama, Romance, Sci-Fi', 'Runtime': '1 hr 36 mins', 'Rating': '7.4'}

>>> json.loads(text.split(' ', 1)[1])
{u'Plot': u'Spanning over one thousand years, and three parallel stories, The Fountain is a story of love, death, spirituality, and the fragility of our existence in this world.', u'Votes': u'100139', u'Rated': u'R', u'Response': u'True', u'Title': u'The Fountain', u'Poster': u'http://ia.media-imdb.com/images/M/MV5BMTU5OTczMTcxMV5BMl5BanBnXkFtZTcwNDg3MTEzMw@@._V1_SX320.jpg', u'Writer': u'Darren Aronofsky, Darren Aronofsky', u'ID': u'tt0414993', u'Director': u'Darren Aronofsky', u'Released': u'22 Nov 2006', u'Actors': u'Hugh Jackman, Rachel Weisz, Sean Patrick Thomas, Ellen Burstyn', u'Year': u'2006', u'Genre': u'Drama, Romance, Sci-Fi', u'Runtime': u'1 hr 36 mins', u'Rating': u'7.4'}
import re

line = 'unicode: {"Title":"The Fountain","Year":"2006","Rated":"R","Released":"22 Nov 2006","Genre":"Drama, Romance, Sci-Fi","Director":"Darren Aronofsky","Writer":"Darren Aronofsky, Darren Aronofsky","Actors":"Hugh Jackman, Rachel Weisz, Sean Patrick Thomas, Ellen Burstyn","Plot":"Spanning over one thousand years, and three parallel stories, The Fountain is a story of love, death, spirituality, and the fragility of our existence in this world.","Poster":"http://ia.media-imdb.com/images/M/MV5BMTU5OTczMTcxMV5BMl5BanBnXkFtZTcwNDg3MTEzMw@@._V1_SX320.jpg","Runtime":"1 hr 36 mins","Rating":"7.4","Votes":"100139","ID":"tt0414993","Response":"True"}'

def parser(text):
    match = re.search(r'\{\"([^}]+)\"\}', text)
    if match:
        return dict(x.split('":"') for x in match.group(1).split('","'))

newdict = parser(line)

for k, v in newdict.items():
    print k, v

我使用正则表达式,但它很容易被任何方法替换,这些方法可以删除检索字符串中最多{和after}。

在删除所有不必要的头字符和尾字符后,您可以尝试将所有数据转换为dict

import re

line = 'unicode: {"Title":"The Fountain","Year":"2006","Rated":"R","Released":"22 Nov 2006","Genre":"Drama, Romance, Sci-Fi","Director":"Darren Aronofsky","Writer":"Darren Aronofsky, Darren Aronofsky","Actors":"Hugh Jackman, Rachel Weisz, Sean Patrick Thomas, Ellen Burstyn","Plot":"Spanning over one thousand years, and three parallel stories, The Fountain is a story of love, death, spirituality, and the fragility of our existence in this world.","Poster":"http://ia.media-imdb.com/images/M/MV5BMTU5OTczMTcxMV5BMl5BanBnXkFtZTcwNDg3MTEzMw@@._V1_SX320.jpg","Runtime":"1 hr 36 mins","Rating":"7.4","Votes":"100139","ID":"tt0414993","Response":"True"}'

def parser(text):
    match = re.search(r'\{\"([^}]+)\"\}', text)
    if match:
        return dict(x.split('":"') for x in match.group(1).split('","'))

newdict = parser(line)

for k, v in newdict.items():
    print k, v

我使用正则表达式,但它很容易被任何方法替换,这些方法可以删除检索字符串中的{和after}。

在我看来,每个人都太努力了。。。如果你真的有

line = 'unicode: {"key1":"Value1", "key2","value2", etc...}'
看起来像一根绳子

然后从字符串的前面去掉unicode:

newline = line[9:]
然后将结果直接计算到dict中

data_dict=eval(newline)
然后通过密钥访问数据

print(data_dict['Title'])

您有完美的格式来创建Python dict,并且可以直接从该容器访问值。

在我看来,每个人都太努力了。。。如果你真的有

line = 'unicode: {"key1":"Value1", "key2","value2", etc...}'
看起来像一根绳子

然后从字符串的前面去掉unicode:

newline = line[9:]
然后将结果直接计算到dict中

data_dict=eval(newline)
然后通过密钥访问数据

print(data_dict['Title'])

您有完美的格式来创建Python dict,并且可以直接从该容器访问值。

这是非常真实和有用的,但是对于不可能像您所说的那样执行的情况,我最好奇的是一个通用的解决方案。很难提供一个通用的解决方案,因此这种特定的结构化数据:-,所有这些解决方案都依赖于数据格式…这是非常正确和有用的,但是我最好奇的是一个通用的解决方案,在不可能按照您所说的那样做的情况下。很难提供一个通用的解决方案,因此这种特定的结构化数据:-,所有这些解决方案都依赖于数据格式。。。