Python 将timeseries表从字符串提取到字典中_Python_Regex_String_Time Series

Python 将timeseries表从字符串提取到字典中

python regex string

Python 将timeseries表从字符串提取到字典中,python,regex,string,time-series,Python,Regex,String,Time Series,我有一个文本文件，其中包含多个时间序列数据，如下所示： Elect Price (Jenkins 1989) 1960 6.64784 1961 6.95902 1962 6.8534 1963 6.95924 1964 6.77416 1965 6.96237 1966 6.94241 1967 6.50688 1968 5.72611 1969 5.45512 1970 5.2703 1971 5.75105 1972 5.26886 1973 5.06676 1975 6.14003

我有一个文本文件，其中包含多个时间序列数据，如下所示：

Elect Price 
(Jenkins 1989)

1960 6.64784
1961 6.95902
1962 6.8534
1963 6.95924
1964 6.77416
1965 6.96237
1966 6.94241
1967 6.50688
1968 5.72611
1969 5.45512
1970 5.2703
1971 5.75105
1972 5.26886
1973 5.06676
1975 6.14003
1976 5.44883
1977 6.49034
1978 7.17429
1979 7.87244
1980 9.20048
1981 7.35384
1982 6.44922
1983 5.44273
1984 4.3131
1985 5.27546
1986 4.99998
1987 5.78054
1988 5.65552

Hydro Electricity 
(Guyol 1969; Energy Information Administration 1995)

1958 5.74306e+009
1959 5.90702e+009
1960 6.40238e+009
1961 6.77396e+009
1962 7.12661e+009
1963 7.47073e+009
1964 7.72361e+009
1980 1.62e+010
1985 1.85e+010
1986 1.88e+010
1987 1.89e+010
1988 1.96e+010
1989 1.95e+010
1990 2.02e+010
1991 2.05e+010
1992 2.04e+010
1993 2.12e+010

Nuclear Electricity
(Guyol 1969; Energy Information Administration 1995)

1958 4.43664e+006
1959 1.34129e+007
1960 2.56183e+007
1961 4.09594e+007
1962 6.09336e+007
1963 1.09025e+008
1964 1.59522e+008
1980 6.40598e+009
1985 1.33e+010
1986 1.42e+010
1987 1.55e+010
1988 1.68e+010
1989 1.73e+010
1990 1.77e+010
1991 1.86e+010
1992 1.88e+010
1993 1.95e+010

def parse_input(s):
    # split by two consecutive newlines
    s = s.split("\n\n")

    out = {}
    for i in range(0, len(s), 2):  # iterate in chunks of two.
        # split key by newline, remove extra spaces, and convert to tuple
        key = tuple(map(lambda x: x.strip(), s[i].split("\n")))
        # split value by newline, split each line by space, and evaluate  
        # each piece of data with the builtin 'eval' function.
        value = list(map(lambda x: tuple(map(eval, x.split())), s[i + 1].split("\n")))
        out[key] = value
    return out

我已将其加载为单个字符串，我想知道将其转换为以下形式的词典的最佳方法：

{('Elect Price', '(Jenkins 1989)'): [(1960, 6.64784), (1961, 6.95902), (1962, 6.8534), ...], ...}

我的第一反应是逐行检查字符串，看看是否有几个不同的正则表达式匹配，然后从那里开始，但我还必须包含逻辑来处理匹配变量名后的操作，然后是引用和数据等

有更好的方法吗？是否可能使用某种模板来提取变量名、引用和所述数据？我确信这是一项相当常见的任务，因此我假设有更多的标准方法/工具用于此任务。

您可以使用内置的string方法

split

。第一次被两个连续的换行分开。然后，一批两批地迭代创建的列表，以单独格式化数据，同时实现

split

以通过单个换行符进行拆分。具体的格式应该很容易，但会很乏味

也许是这样的：

Elect Price 
(Jenkins 1989)

1960 6.64784
1961 6.95902
1962 6.8534
1963 6.95924
1964 6.77416
1965 6.96237
1966 6.94241
1967 6.50688
1968 5.72611
1969 5.45512
1970 5.2703
1971 5.75105
1972 5.26886
1973 5.06676
1975 6.14003
1976 5.44883
1977 6.49034
1978 7.17429
1979 7.87244
1980 9.20048
1981 7.35384
1982 6.44922
1983 5.44273
1984 4.3131
1985 5.27546
1986 4.99998
1987 5.78054
1988 5.65552

Hydro Electricity 
(Guyol 1969; Energy Information Administration 1995)

1958 5.74306e+009
1959 5.90702e+009
1960 6.40238e+009
1961 6.77396e+009
1962 7.12661e+009
1963 7.47073e+009
1964 7.72361e+009
1980 1.62e+010
1985 1.85e+010
1986 1.88e+010
1987 1.89e+010
1988 1.96e+010
1989 1.95e+010
1990 2.02e+010
1991 2.05e+010
1992 2.04e+010
1993 2.12e+010

Nuclear Electricity
(Guyol 1969; Energy Information Administration 1995)

1958 4.43664e+006
1959 1.34129e+007
1960 2.56183e+007
1961 4.09594e+007
1962 6.09336e+007
1963 1.09025e+008
1964 1.59522e+008
1980 6.40598e+009
1985 1.33e+010
1986 1.42e+010
1987 1.55e+010
1988 1.68e+010
1989 1.73e+010
1990 1.77e+010
1991 1.86e+010
1992 1.88e+010
1993 1.95e+010

def parse_input(s):
    # split by two consecutive newlines
    s = s.split("\n\n")

    out = {}
    for i in range(0, len(s), 2):  # iterate in chunks of two.
        # split key by newline, remove extra spaces, and convert to tuple
        key = tuple(map(lambda x: x.strip(), s[i].split("\n")))
        # split value by newline, split each line by space, and evaluate  
        # each piece of data with the builtin 'eval' function.
        value = list(map(lambda x: tuple(map(eval, x.split())), s[i + 1].split("\n")))
        out[key] = value
    return out

由于我是stackoverflow新手，请告诉我如何改进我的答案。

我最终找到了一个很棒的网站，帮助解析以类似格式存储的数据。我不确定如何使用正则表达式解析多行数据。我没有这样表述这个问题，因为我不想把它局限于这种方法，但我想到的是使用这个网站：

import re
import pandas as pd

rx_dict = {'data': re.compile(r'^(\d+)\s'),
           'citation': re.compile(r'^(?P<citation>\(.+\))'),
           'variable': re.compile(r'^(?P<variable>[\w|\d|\s]+)$')}


def _parse_line(line):
    """
    Do a regex search against all defined regexes and
    return the key and match result of the first matching regex

    """

    for key, rx in rx_dict.items():
        match = rx.search(line)
        if match:
            return key, match
    # if there are no matches
    return None, None


def parse_file(filepath):
    """
    Parse text at given filepath

    Parameters
    ----------
    filepath : str
        Filepath for file_object to be parsed

    Returns
    -------
    data : dict
        Parsed data

    """

    data = {}  # create an empty dict to collect the data
    # open the file and read through it line by line
    with open(filepath, 'r') as file_object:
        line = file_object.readline()
        while line:
            if not line.strip():
                line = file_object.readline()
            # at each line check for a match with a regex
            key, match = _parse_line(line)

            # extract variable name
            if key == 'variable':
                variable = match.group('variable').strip()

            # extract citation
            if key == 'citation':
                citation = match.group('citation').strip()

            # identify beginning of data
            if key == 'data':
                data[(variable, citation)] = [[], []]
                # read each line of the table until a blank line
                while line.strip():
                    # extract number and value
                    year = int(line.split(' ')[0])
                    value = float(line.split(' ')[1])

                    data[(variable, citation)][0].append(year)
                    data[(variable, citation)][1].append(value)

                    line = file_object.readline()

            line = file_object.readline()

    return data


if __name__ == "__main__":
    filepath = "data_txt.txt"

    data = parse_file(filepath)

重新导入
作为pd进口熊猫
rx_dict={'data'：重新编译（r'^（\d+\s'），
“引文”：重新编译（r'^（？P\（.+\）），
“变量”：重新编译（r'^（？P[\w\d\s]+）$）}
定义解析行（行）：
"""
对所有已定义的正则表达式执行正则表达式搜索，并
返回第一个匹配正则表达式的键和匹配结果
"""
对于键，rx_dict.items（）中的rx：
匹配=接收搜索（行）
如果匹配：
返回键，匹配
#如果没有匹配项
返回None，None
def parse_文件（文件路径）：
"""
在给定的文件路径上解析文本
参数
----------
文件路径：str
要分析的文件\u对象的文件路径
退换商品
-------
数据：dict
解析数据
"""
data={}#创建一个空的dict来收集数据
#打开文件，逐行阅读
将open（filepath，'r'）作为文件对象：
line=file\u object.readline（）
while line：
如果不是line.strip（）：
line=file\u object.readline（）
#在每行检查是否与正则表达式匹配
键，匹配=_解析_行（行）
#提取变量名
如果键=='变量'：
variable=match.group（'variable'）.strip（）
#摘录引文
如果键=='引用'：
引文=匹配.group（'引文'）.strip（）
#确定数据的开头
如果键==“数据”：
数据[（变量，引用）]=[]，[]
#阅读表格的每一行，直到出现一个空行
while line.strip（）：
#提取数字和值
年份=整数（行分割（“”）[0]）
值=浮动（行分割（“”）[1]）
数据[（变量，引用）][0]。追加（年）
数据[（变量，引用）][1]。追加（值）
line=file\u object.readline（）
line=file\u object.readline（）
返回数据
如果名称=“\uuuuu main\uuuuuuuu”：
filepath=“data_txt.txt”
数据=解析文件（文件路径）

这种方法在字符串的每一行上测试一组正则表达式，以确定它是否包含变量名、引用或数据。找到数据后，读取并处理每一行，直到找到空行。这给了我一些接近预期的结果，除了我选择将数据存储在列表列表中而不是元组列表中。

我不太清楚您为什么在这里使用eval函数，但我很感激您的回答。我最终想出了一些有效的方法，但我仍然有兴趣看看其他人可能会想出什么。我只是想有一种简单的方法来解析数据，第一个数字是

int

，第二个数字是

float

。