以Python(numpy)格式将文件作为连续内存数组导入

以Python(numpy)格式将文件作为连续内存数组导入,python,python-2.7,numpy,Python,Python 2.7,Numpy,我想从文本文件导入数据,并将其作为连续内存数组读入。这是数据,每个受访者之间用一份申报表隔开: ['vrouw',43',2',onbeantwoord','2','2','onbeantwoord',''] ['vrouw',34','2','onbeantwoord','2','2','onbeantwoord',''] ['vrouw',32',2',onbeantwoord','2','2','onbeantwoord',''] ['vrouw',32',2',onbeantwoord'

我想从文本文件导入数据,并将其作为连续内存数组读入。这是数据,每个受访者之间用一份申报表隔开:

['vrouw',43',2',onbeantwoord','2','2','onbeantwoord','']

['vrouw',34','2','onbeantwoord','2','2','onbeantwoord','']

['vrouw',32',2',onbeantwoord','2','2','onbeantwoord','']

['vrouw',32',2',onbeantwoord','2','2','onbeantwoord','']

['vrouw',43',3',sport',2',2',onbeantwoord','']

['vrouw',32',2',onbeantwoord','2','2','onbeantwoord','']

['vrouw',43',2',onbeantwoord',3',3',collega',nee']

我尝试使用以下代码从文本文件导入数据:

vragenlijst_data= np.genfromtxt('antwoorden.txt', delimiter=',', dtype=None, names=('geslacht', 'leeftijd', 'stelling1', 'doorvraag1', 'stelling2', 'stelling3', 'doorvraag3', 'opmerking'))

然而,这种方式我不能以矢量化的方式使用np.mean(来自numpy库),因为我没有连续的内存数组。有人知道读取数据的方法吗?这样我就有了一个连续的内存阵列(最好是numpy)?

和您的行的复制粘贴:

In [362]: txt
Out[362]: "['vrouw', 43, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 34, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 32, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 32, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 43, '3', 'sport', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 32, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 43, '2', 'onbeantwoord', '3', '3', 'collega', 'nee']"

In [364]: data = np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None, encoding=None)
In [365]: data
Out[365]: 
array([("['vrouw'", 43, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
       ("['vrouw'", 34, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
       ("['vrouw'", 32, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
       ("['vrouw'", 32, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
       ("['vrouw'", 43, " '3'", " 'sport'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
       ("['vrouw'", 32, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
       ("['vrouw'", 43, " '2'", " 'onbeantwoord'", " '3'", " '3'", " 'collega'", " 'nee']")],
      dtype=[('f0', '<U8'), ('f1', '<i8'), ('f2', '<U4'), ('f3', '<U15'), ('f4', '<U4'), ('f5', '<U4'), ('f6', '<U15'), ('f7', '<U7')])
genfromtxt
不会删除括号,因此“f0”字符串仍然保留括号

额外的引号层也使得将其他字段转换为整数变得更加困难

如果文件具有更清晰的csv值,则更易于读取和使用:

In [372]: txt1 = """vrouw, 43, 2, onbeantwoord, 2, 2, onbeantwoord, ''
     ...: vrouw, 34, 2, onbeantwoord, 2, 2, onbeantwoord, '' """
     ...: 
In [373]: 
In [373]: data1 = np.genfromtxt(txt1.splitlines(), delimiter=',',dtype=None, enc
     ...: oding=None)
In [374]: data1
Out[374]: 
array([('vrouw', 43, 2, ' onbeantwoord', 2, 2, ' onbeantwoord', " ''"),
       ('vrouw', 34, 2, ' onbeantwoord', 2, 2, ' onbeantwoord', " ''")],
      dtype=[('f0', '<U5'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<U13'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U13'), ('f7', '<U3')])
In [375]: data1['f0']
Out[375]: array(['vrouw', 'vrouw'], dtype='<U5')
In [376]: data1['f1']
Out[376]: array([43, 34])
In [377]: data1['f5']
Out[377]: array([2, 2])
[372]中的
:txt1=“”vrouw,43,2,onbeantwoord,2,2,onbeantwoord,”
…vrouw,34,2,onbeantwoord,2,2,onbeantwoord,“”“
...: 
在[373]中:
在[373]中:data1=np.genfromtxt(txt1.splitlines(),分隔符=',',dtype=None,enc
…:oding=None)
在[374]中:数据1
Out[374]:
数组([('vrouw',43,2,'onbeantwoord',2,2,'onbeantwoord',“''),
('vrouw',34,2',onbeantwoord',2,2',onbeantwoord',“''”),

dtype=[('f0','您的数据格式不正确,看起来它只是
print
的输出。我认为您找不到任何库函数使数据可用(gentext使用格式不正确的数据构建数组)。因此:

import re

with open('antwoorden.txt') as f:
    lines = f.readlines()

vragenlijst = []
for line in lines:
    line = re.sub("[',\[\]]", '', line.strip())
    line = [x for x in line.split()]
    if len(line)==7:
        line += ['']
    vragenlijst.append(tuple(line))
vragenlijst现在是一个包含8个元组的python列表,其中每个成员都是一个字符串。元组对于numpy的结构化数组是必需的。因此,现在您可以像这样构建数据类型:

vragenlijst_dtype = np.dtype([('geslacht', 'U10'), ('leeftijd', 'i4'), 
        ('stelling1', 'U10'), ('doorvraag1', 'U10'), ('stelling2', 'U10'), 
        ('stelling3', 'U10'), ('doorvraag3', 'U10'), ('opmerking', 'U10')])
其中“U10”表示unicode长度为10个字符,i4表示长度为4字节的整数。如果类型不适合实际数据,则可以更改类型

然后:

vragenlijst = np.array(vragenlijst, dtype=vragenlijst_dtype)
list_mean = np.mean(vragenlijst['leeftijd'])

哪个输出“37.0”

你不能使用
np.mean
因为你有一个带字符串的
dtype=object
数组。或者你希望在
str
上使用
np.mean
有什么输出?你想在所有数字列上使用np.mean还是只在leeftijd列上使用np.mean?我只想在'leeftijd'列上使用np.mean描述genfrom测试的结果。邵,数据类型等。括号和引号使读取该文件变得混乱。您想要什么类型的平均值?genfromtxt的结果是:。我想计算numpy中'leeftijd'的平均值。代码'mean=np.mean([int(I[1])表示vragenlijst_数据中的I])起作用,但我想以矢量化的方式使用np.mean,这样我就可以使用np.mean了(vragenlijst_数据[1])
vragenlijst = np.array(vragenlijst, dtype=vragenlijst_dtype)
list_mean = np.mean(vragenlijst['leeftijd'])