如何在不使用Python中的外部库的情况下解析arff文件_Python_Parsing_Machine Learning_Arff

如何在不使用Python中的外部库的情况下解析arff文件

python parsing machine-learning

如何在不使用Python中的外部库的情况下解析arff文件,python,parsing,machine-learning,arff,Python,Parsing,Machine Learning,Arff,我需要在不使用任何外部库的情况下解析如下所示的arff文件。我不确定如何将属性与数值关联起来。比如我怎么能说每行的第一个数值是年龄，第二个是性别？您还可以将我链接到一些用于解析类似场景的python代码吗 @relation cleveland-14-heart-disease @attribute 'age' real @attribute 'sex' { female, male} @attribute 'cp' { typ_angina, asympt, non_anginal, atyp

我需要在不使用任何外部库的情况下解析如下所示的arff文件。我不确定如何将属性与数值关联起来。比如我怎么能说每行的第一个数值是年龄，第二个是性别？您还可以将我链接到一些用于解析类似场景的python代码吗

@relation cleveland-14-heart-disease
@attribute 'age' real
@attribute 'sex' { female, male}
@attribute 'cp' { typ_angina, asympt, non_anginal, atyp_angina}
@attribute 'trestbps' real
@attribute 'chol' real
@attribute 'fbs' { t, f}
@attribute 'restecg' { left_vent_hyper, normal, st_t_wave_abnormality}
@attribute 'thalach' real
@attribute 'exang' { no, yes}
@attribute 'oldpeak' real
@attribute 'slope' { up, flat, down}
@attribute 'ca' real
@attribute 'thal' { fixed_defect, normal, reversable_defect}
@attribute 'class' { negative, positive}
@data
63,male,typ_angina,145,233,t,left_vent_hyper,150,no,2.3,down,0,fixed_defect,negative
37,male,non_anginal,130,250,f,normal,187,no,3.5,down,0,normal,negative
41,female,atyp_angina,130,204,f,left_vent_hyper,172,no,1.4,up,0,normal,negative
56,male,atyp_angina,120,236,f,normal,178,no,0.8,up,0,normal,negative
57,female,asympt,120,354,f,normal,163,yes,0.6,up,0,normal,negative
57,male,asympt,140,192,f,normal,148,no,0.4,flat,0,fixed_defect,negative
56,female,atyp_angina,140,294,f,left_vent_hyper,153,no,1.3,flat,0,normal,negative
44,male,atyp_angina,120,263,f,normal,173,no,0,up,0,reversable_defect,negative
52,male,non_anginal,172,199,t,normal,162,no,0.5,up,0,reversable_defect,negative

以下是我编写的示例代码：

arr=[]
arff_file = open("heart_train.arff")
count=0
for line in arff_file:
        count+=1
        #line=line.strip("\n")
        #line=line.split(',')
        if not (line.startswith("@")):
                if not (line.startswith("%")):
                        line=line.strip("\n")
                        line=line.split(',')
                        arr.append(line)



print(arr[1:30])

但是，输出与我预期的非常不同：

[['37', 'male', 'non_anginal', '130', '250', 'f', 'normal', '187', 'no', '3.5', 'down', '0', 'normal', 'negative'], ['41', 'female', 'atyp_angina', '130', '204', 'f', 'left_vent_hyper', '172', 'no', '1.4', 'up', '0', 'normal', 'negative'], ['56', 'male', 'atyp_angina', '120', '236', 'f', 'normal', '178', 'no', '0.8', 'up', '0', 'normal', 'negative'], ['57', 'female', 'asympt', '120', '354', 'f', 'normal', '163', 'yes', '0.6', 'up', '0', 'normal', 'negative'], ['57', 'male', 'asympt', '140', '192', 'f', 'normal', '148', 'no', '0.4', 'flat', '0', 'fixed_defect', 'negative'], ['56', 'female', 'atyp_angina', '140', '294', 'f', 'left_vent_hyper', '153', 'no', '1.3', 'flat', '0', 'normal', 'negative'], ['44', 'male', 'atyp_angina', '120', '263', 'f', 'normal', '173', 'no', '0', 'up', '0', 'reversable_defect', 'negative'], ['52', 'male', 'non_anginal', '172', '199', 't', 'normal', '162', 'no', '0.5', 'up', '0', 'reversable_defect', 'negative'], ['57', 'male', 'non_anginal', '150', '168', 'f', 'normal', '174', 'no', '1.6', 'up', '0', 'normal', 'negative'], ['54', 'male', 'asympt', '140', '239', 'f', 'normal', '160', 'no', '1.2', 'up', '0', 'normal', 'negative'], ['48', 'female', 'non_anginal', '130', '275', 'f', 'normal', '139', 'no', '0.2', 'up', '0', 'normal', 'negative'], ['49', 'male', 'atyp_angina', '130', '266', 'f', 'normal', '171', 'no', '0.6', 'up', '0', 'normal', 'negative'], ['64', 'male', 'typ_angina', '110', '211', 'f', 'left_vent_hyper', '144', 'yes', '1.8', 'flat', '0', 'normal', 'negative'], ['58', 'female', 'typ_angina', '150', '283', 't', 'left_vent_hyper', '162', 'no', '1', 'up', '0', 'normal', 'negative'], ['50', 'female', 'non_anginal', '120', '219', 'f', 'normal', '158', 'no', '1.6', 'flat', '0', 'normal', 'negative'], ['58', 'female', 'non_anginal', '120', '340', 'f', 'normal', '172', 'no', '0', 'up', '0', 'normal', 'negative'], ['66', 'female', 'typ_angina', '150', '226', 'f', 'normal', '114', 'no', '2.6', 'down', '0', 'normal', 'negative'], ['43', 'male', 'asympt', '150', '247', 'f', 'normal', '171', 'no', '1.5', 'up', '0', 'normal', 'negative'], ['69', 'female', 'typ_angina', '140', '239', 'f', 'normal', '151', 'no', '1.8', 'up', '2', 'normal', 'negative'], ['59', 'male', 'asympt', '135', '234', 'f', 'normal', '161', 'no', '0.5', 'flat', '0', 'reversable_defect', 'negative'], ['44', 'male', 'non_anginal', '130', '233', 'f', 'normal', '179', 'yes', '0.4', 'up', '0', 'normal', 'negative'], ['42', 'male', 'asympt', '140', '226', 'f', 'normal', '178', 'no', '0', 'up', '0', 'normal', 'negative'], ['61', 'male', 'non_anginal', '150', '243', 't', 'normal', '137', 'yes', '1', 'flat', '0', 'normal', 'negative'], ['40', 'male', 'typ_angina', '140', '199', 'f', 'normal', '178', 'yes', '1.4', 'up', '0', 'reversable_defect', 'negative'], ['71', 'female', 'atyp_angina', '160', '302', 'f', 'normal', '162', 'no', '0.4', 'up', '2', 'normal', 'negative'], ['59', 'male', 'non_anginal', '150', '212', 't', 'normal', '157', 'no', '1.6', 'up', '0', 'normal', 'negative'], ['51', 'male', 'non_anginal', '110', '175', 'f', 'normal', '123', 'no', '0.6', 'up', '0', 'normal', 'negative'], ['65', 'female', 'non_anginal', '140', '417', 't', 'left_vent_hyper', '157', 'no', '0.8', 'up', '1', 'normal', 'negative'], ['53', 'male', 'non_anginal', '130', '197', 't', 'left_vent_hyper', '152', 'no', '1.2', 'down', '0', 'normal', 'negative']]

您知道如何获得由arff库（来自Weka）创建的如下输出吗？

您说“没有外部库”，但您至少可以剪切并粘贴到您自己的代码中吗？您可能会发现有用的（200行，约5.6KB）

编辑：

您可能会发现此格式参考非常有用：

Edit2:

只是为了好玩，我编写了自己的.arrf解析器；它几乎和WEKA代码一样长，但应该更具可读性——只有六个函数、一个调度表和一个非常模块化的类。您可以在类实例上迭代，以将每个数据行作为namedtuple

看看你怎么想：

from collections import namedtuple
from keyword import iskeyword
import re

def NotDone(msg):
    raise NotImplemented(msg)

def nominal(spec):
    """
    Create an ARFF nominal (enumerated) data type
    """
    spec = spec.lstrip("{ \t").rstrip("} \t")
    good_values = set(val.strip() for val in spec.split(","))

    def fn(s):
        s = s.strip()
        if s in good_values:
            return s
        else:
            raise ValueError("'{}' is not a recognized value".format(s))

    # patch docstring
    fn.__name__ = "nominal"
    fn.__doc__ = """
    ARFF nominal (enumerated) data type

    Legal values are {}
    """.format(sorted(good_values))
    return fn

def numeric(s):
    """
    Convert string to int or float
    """
    try:
        return int(s)
    except ValueError:
        return float(s)

field_maker = {
    "date":       (lambda spec: NotDone("date data type not implemented")),
    "integer":    (lambda spec: int),
    "nominal":    (lambda spec: nominal(spec)),
    "numeric":    (lambda spec: numeric),
    "string":     (lambda spec: str),
    "real":       (lambda spec: float),
    "relational": (lambda spec: NotDone("relational data type not implemented")),
}

def file_lines(fname):
    # lazy file reader; ensures file is closed when done,
    # returns lines without trailing spaces or newline
    with open(fname) as inf:
        for line in inf:
            yield line.rstrip()

def no_data_yet(*items):
    raise ValueError("AarfRow not fully defined (haven't seen a @data directive yet)")

def make_field_name(s):
    """
    Mangle string to make it a valid Python identifier
    """
    s = s.lower()                               # force to lowercase
    s = "_".join(re.findall("[a-z0-9]+", s))    # strip all invalid chars; join what's left with "_"
    if iskeyword(s) or re.match("[0-9]", s):    # if the result is a keyword or starts with a digit
        s = "f_"+s                              #   make it a safe field name
    return s  

class ArffReader:
    line_types = ["blank", "comment", "relation", "attribute", "data"]

    def __init__(self, fname):
        # get input file
        self.fname = fname
        self.lines = file_lines(fname)

        # prepare to read file header
        self.relation = '(not specified)'
        self.data_names = []
        self.data_types = []
        self.dtype = no_data_yet

        # read file header
        line_tests = [
            (getattr(self, "line_is_{}".format(item)), getattr(self, "line_do_{}".format(item)))
            for item in self.__class__.line_types
        ]
        for line in self.lines:
            for is_, do in line_tests:
                if is_(line):
                    done = do(line)
                    break
            if done:
                break

        # use header fields to build data type (and make it print as requested)
        class ArffRow(namedtuple('ArffRow', self.data_names)):
            __slots__ = ()
            def __str__(self):
                items = (getattr(self, field) for field in self._fields)
                return "({})".format(", ".join(repr(it) for it in items))
        self.dtype = ArffRow

    #
    # figure out input-line type
    #

    def line_is_blank(self, line):
        return not line

    def line_is_comment(self, line):
        return line.lower().startswith('%')

    def line_is_relation(self, line):
        return line.lower().startswith('@relation')

    def line_is_attribute(self, line):
        return line.lower().startswith('@attribute')

    def line_is_data(self, line):
        return line.lower().startswith('@data')

    #    
    # handle input-line type
    #    

    def line_do_blank(self, line):
        pass

    def line_do_comment(self, line):
        pass

    def line_do_relation(self, line):
        self.relation = line[10:].strip()

    def line_do_attribute(self, line):
        m = re.match(
            "^@attribute"           #   line starts with '@attribute'
            "\s+"                   #
            "("                     # name is one of:
                "(?:'[^']+')"       #   ' string in single-quotes '
                "|(?:\"[^\"]+\")"   #   " string in double-quotes "
                "|(?:[^ \t'\"]+)"   #   single_word_string (no spaces)
            ")"                     #
            "\s+"                   #
            "("                     # type is one of:
                "(?:{[^}]+})"       #   { set, of, nominal, values }
                "|(?:\w+)"          #   datatype
            ")"                     #
            "\s*"                   #
            "("                     # spec string
                ".*"                #   anything to end of line
            ")$",                   #
            line, flags=re.I)       #   case-insensitive
        if m:
            name, type_, spec = m.groups()
            self.data_names.append(make_field_name(name))
            if type_[0] == '{':
                type_, spec = 'nominal', type_
            self.data_types.append(field_maker[type_](spec))
        else:
            raise ValueError("failed parsing attribute line '{}'".format(line))

    def line_do_data(self, line):
        return True  # flag end of header

    #
    # make the class iterable
    #

    def __iter__(self):
        return self

    def next(self):
        """
        Return one data row at a time
        """
        data = next(self.lines).split(',')
        return self.dtype(*(fn(dat) for fn,dat in zip(self.data_types, data)))

它可以用作

for row in ArffReader('mydata.arff'):
    print(row)

导致

(63.0, 'male', 'typ_angina', 145.0, 233.0, 't', 'left_vent_hyper', 150.0, 'no', 2.3, 'down', 0.0, 'fixed_defect', 'negative')
(37.0, 'male', 'non_anginal', 130.0, 250.0, 'f', 'normal', 187.0, 'no', 3.5, 'down', 0.0, 'normal', 'negative')
(41.0, 'female', 'atyp_angina', 130.0, 204.0, 'f', 'left_vent_hyper', 172.0, 'no', 1.4, 'up', 0.0, 'normal', 'negative')
(56.0, 'male', 'atyp_angina', 120.0, 236.0, 'f', 'normal', 178.0, 'no', 0.8, 'up', 0.0, 'normal', 'negative')
(57.0, 'female', 'asympt', 120.0, 354.0, 'f', 'normal', 163.0, 'yes', 0.6, 'up', 0.0, 'normal', 'negative')
(57.0, 'male', 'asympt', 140.0, 192.0, 'f', 'normal', 148.0, 'no', 0.4, 'flat', 0.0, 'fixed_defect', 'negative')
(56.0, 'female', 'atyp_angina', 140.0, 294.0, 'f', 'left_vent_hyper', 153.0, 'no', 1.3, 'flat', 0.0, 'normal', 'negative')
(44.0, 'male', 'atyp_angina', 120.0, 263.0, 'f', 'normal', 173.0, 'no', 0.0, 'up', 0.0, 'reversable_defect', 'negative')
(52.0, 'male', 'non_anginal', 172.0, 199.0, 't', 'normal', 162.0, 'no', 0.5, 'up', 0.0, 'reversable_defect', 'negative')

这些字段也可以通过名称进行寻址，即

for patient in ArffReader('mydata.arff'):
    print("{} year old {}".format(patient.age, patient.sex))

给

63.0 year old male
37.0 year old male
41.0 year old female
56.0 year old male
57.0 year old female
57.0 year old male
56.0 year old female
44.0 year old male
52.0 year old male

您可以通过以下方式查看文件名：

>>> print(repr(patient))
ArffRow(age=63.0, sex='male', cp='typ_angina', trestbps=145.0, chol=233.0, fbs='t', restecg='left_vent_hyper', thalach=150.0, exang='no', oldpeak=2.3, slope='down', ca=0.0, thal='fixed_defect', f_class='negative')

字段名与ARFF头一致，强制小写（在“class”前面加上“f_u”，因为

class

是Python关键字，因此不能用作字段名）。

这看起来很容易解析。你试过什么？当您发布一些代码时，堆栈溢出效果更好。请分享您迄今为止尝试过的内容？我们会让你做得更好way@shaktimaan此外，此行也不起作用：`if（（line.startswith！=“@”）和（line.startswith！=“%”）：`MonaJalal:因为它的语法是

line.startswith（“@”）

。如果您希望它不等于，请使用

如果不等于（line.startswith（“@”）

现在您正在创建一个列表的

列表。尝试创建元组列表

。将您的上一句话更改为

arr.append（tuple（line））

我知道我可以做到这一点，但我正计划复习我的知识。我认为这是一件好事。也谢谢你的建议。@HughBothwell这张摊开评论的reg-ex刚刚好