Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/dart/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何读取csv_Python_Numpy_Pandas_Anaconda - Fatal编程技术网

Python 如何读取csv

Python 如何读取csv,python,numpy,pandas,anaconda,Python,Numpy,Pandas,Anaconda,我有一个存储在csv文件中的数据,格式如下 892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q 893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S 894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q 895,3,"Wirz, Mr. Albert",male,27,0,0,31515

我有一个存储在csv文件中的数据,格式如下

892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
每个列的数据类型

1. int        6. int
2. int        7. int
3. String     8. float
4. String     9. float
5. float      10.String
              11.String
第一列以892,893。。。897应以
int
格式存储在
array
中。第三列如“Wilkes,Mrs.James(Ellen需要)”应存储在
字符串中。但是,第三列是
string
格式,但字符长度不是固定的,即我不知道此列中存储的最大字符长度

我已经做了:

 csv_file_object = csv.reader(open('trainData.csv', 'rb'))
 header = csv_file_object.next()

 data=[]
 for row in csv_file_object:
    data.append(row)
    data = np.array(data)
892 3 "Kelly, Mr. James" male 34.5 0 0 330911 7.8292 NaN Q
893 3 "Wilkes, Mrs. James (Ellen Needs)" female 47 1 0 363272 7 NaN S
894 2 "Myles, Mr. Thomas Francis" male 62 0 0 240276 9.6875 NaN Q
895 3 "Wirz, Mr. Albert" male 27 0 0 315154 8.6625 NaN S
896 3 "Hirvonen, Mrs. Alexander (Helga E Lindqvist)" female 22 1 1 3101298 12.2875 NaN S
897 3 "Svensson, Mr. Johan Cervin" male 14 0 0 7538 9.225 S
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911, 7.8292, None, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47, 1, 0, 363272, 7, None, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62, 0, 0, 240276, 9.6875, None, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27, 0, 0, 315154, 8.6625, None, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22, 1, 1, 3101298, 12.2875, None, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14, 0, 0, 7538, 9.225, None, 'S']]
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911.0, 7.8292, nan, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47.0, 1, 0, 363272.0, 7.0, nan, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62.0, 0, 0, 240276.0, 9.6875, nan, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27.0, 0, 0, 315154.0, 8.6625, nan, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22.0, 1, 1, 3101298.0, 12.2875, nan, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14.0, 0, 0, 7538.0, 9.225, nan, 'S']]
但是,上面的代码将所有列读取为
字符串
,但其中许多不是
字符串
格式
,并以
字符串
格式存储信息。另一方面,如果我使用了
genfromtxt
,第三列是问题,因为它在双配额中包含逗号

我希望用自己的数据类型存储每一列,即第一列应存储为
int
type

我的预期阵列:

 csv_file_object = csv.reader(open('trainData.csv', 'rb'))
 header = csv_file_object.next()

 data=[]
 for row in csv_file_object:
    data.append(row)
    data = np.array(data)
892 3 "Kelly, Mr. James" male 34.5 0 0 330911 7.8292 NaN Q
893 3 "Wilkes, Mrs. James (Ellen Needs)" female 47 1 0 363272 7 NaN S
894 2 "Myles, Mr. Thomas Francis" male 62 0 0 240276 9.6875 NaN Q
895 3 "Wirz, Mr. Albert" male 27 0 0 315154 8.6625 NaN S
896 3 "Hirvonen, Mrs. Alexander (Helga E Lindqvist)" female 22 1 1 3101298 12.2875 NaN S
897 3 "Svensson, Mr. Johan Cervin" male 14 0 0 7538 9.225 S
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911, 7.8292, None, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47, 1, 0, 363272, 7, None, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62, 0, 0, 240276, 9.6875, None, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27, 0, 0, 315154, 8.6625, None, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22, 1, 1, 3101298, 12.2875, None, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14, 0, 0, 7538, 9.225, None, 'S']]
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911.0, 7.8292, nan, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47.0, 1, 0, 363272.0, 7.0, nan, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62.0, 0, 0, 240276.0, 9.6875, nan, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27.0, 0, 0, 315154.0, 8.6625, nan, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22.0, 1, 1, 3101298.0, 12.2875, nan, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14.0, 0, 0, 7538.0, 9.225, nan, 'S']]
如您所见,如果数据不可用,则应放置
NaN
或其派生词


我应该读什么csv文件?

我不确定我是否理解您的意思,但我认为这对您有用

我实现了另外两个函数,它们将决定字符串是浮点还是整数

如果字符串是空字符串,我写了None,但是,您可以将其更改为您喜欢的任何内容

import csv
import numpy as np

def isfloat(x):
    try:
        a = float(x)
    except ValueError:
        return False
    else:
        return True

def isint(x):
    try:
        a = float(x)
        b = int(a)
    except ValueError:
        return False
    else:
        return a == b


csv_file_object = csv.reader(open('trainData.csv', 'rb'))
header = csv_file_object

data=[]
for row in csv_file_object:
    for index, cell in enumerate(row):
        if isint(cell):
            row[index] = int(cell)
        elif isfloat(cell):
            row[index] = float(cell)
        if not cell: # cell == ''
            row[index] = None  # you can change the value to whatever you like.
    data.append(row)

print data
输出:

 csv_file_object = csv.reader(open('trainData.csv', 'rb'))
 header = csv_file_object.next()

 data=[]
 for row in csv_file_object:
    data.append(row)
    data = np.array(data)
892 3 "Kelly, Mr. James" male 34.5 0 0 330911 7.8292 NaN Q
893 3 "Wilkes, Mrs. James (Ellen Needs)" female 47 1 0 363272 7 NaN S
894 2 "Myles, Mr. Thomas Francis" male 62 0 0 240276 9.6875 NaN Q
895 3 "Wirz, Mr. Albert" male 27 0 0 315154 8.6625 NaN S
896 3 "Hirvonen, Mrs. Alexander (Helga E Lindqvist)" female 22 1 1 3101298 12.2875 NaN S
897 3 "Svensson, Mr. Johan Cervin" male 14 0 0 7538 9.225 S
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911, 7.8292, None, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47, 1, 0, 363272, 7, None, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62, 0, 0, 240276, 9.6875, None, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27, 0, 0, 315154, 8.6625, None, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22, 1, 1, 3101298, 12.2875, None, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14, 0, 0, 7538, 9.225, None, 'S']]
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911.0, 7.8292, nan, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47.0, 1, 0, 363272.0, 7.0, nan, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62.0, 0, 0, 240276.0, 9.6875, nan, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27.0, 0, 0, 315154.0, 8.6625, nan, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22.0, 1, 1, 3101298.0, 12.2875, nan, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14.0, 0, 0, 7538.0, 9.225, nan, 'S']]

您可以更轻松地使用pandas库,如下所示:

import pandas as pd

df = pd.read_csv("trainData.csv", dtype={'col1': int, 'col2': int, 'col3': str, 'col4': str, 'col5': float, 'col6':int,
                                  'col7': int, 'col8': float, 'col9':float, 'col10': str, 'col11': str})
df = map(list, df.values)
print df
输出:

 csv_file_object = csv.reader(open('trainData.csv', 'rb'))
 header = csv_file_object.next()

 data=[]
 for row in csv_file_object:
    data.append(row)
    data = np.array(data)
892 3 "Kelly, Mr. James" male 34.5 0 0 330911 7.8292 NaN Q
893 3 "Wilkes, Mrs. James (Ellen Needs)" female 47 1 0 363272 7 NaN S
894 2 "Myles, Mr. Thomas Francis" male 62 0 0 240276 9.6875 NaN Q
895 3 "Wirz, Mr. Albert" male 27 0 0 315154 8.6625 NaN S
896 3 "Hirvonen, Mrs. Alexander (Helga E Lindqvist)" female 22 1 1 3101298 12.2875 NaN S
897 3 "Svensson, Mr. Johan Cervin" male 14 0 0 7538 9.225 S
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911, 7.8292, None, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47, 1, 0, 363272, 7, None, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62, 0, 0, 240276, 9.6875, None, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27, 0, 0, 315154, 8.6625, None, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22, 1, 1, 3101298, 12.2875, None, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14, 0, 0, 7538, 9.225, None, 'S']]
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911.0, 7.8292, nan, 'Q'],
 [893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47.0, 1, 0, 363272.0, 7.0, nan, 'S'],
 [894, 2, 'Myles, Mr. Thomas Francis', 'male', 62.0, 0, 0, 240276.0, 9.6875, nan, 'Q'],
 [895, 3, 'Wirz, Mr. Albert', 'male', 27.0, 0, 0, 315154.0, 8.6625, nan, 'S'],
 [896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22.0, 1, 1, 3101298.0, 12.2875, nan, 'S'],
 [897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14.0, 0, 0, 7538.0, 9.225, nan, 'S']]
csv文件应该是这样的,因为第一行是列名

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S

你可以在这里阅读更多关于熊猫的信息

我假设你正在使用熊猫,因为这个问题被标记为熊猫。像这样读取文件:

df = pd.read_csv('test.txt', skiprows=0, index_col=0, 
            names='city_type name sex weight has_cat has_dog bank_balance body_fat_index car_mileage car_type'.split())
您将获得如下数据帧:

我冒昧地为这些专栏起了名字

一旦你把数据读入数据框,你可以用它做各种各样的魔术——看看熊猫教程(它们很棒)。这里有一个例子

df.bank_balance.describe()

count          6.000000
mean      726408.166667
std      1170522.652019
min         7538.000000
25%       258995.500000
50%       323032.500000
75%       355181.750000
max      3101298.000000
Name: bank_balance, dtype: float64

@Carenvandlee,我的回答没有满足你的问题?你说“我不确定我是否完全理解你,但我认为这对你有用。”这似乎有效,但编码太多了。我正在寻找轻量的解决方案。此外,所有列类型在编译时都是已知的,即在您的代码中,我的意思太多了,否则请检查。
pandas.read_csv('data.csv',dtypes=[int,int,str])
?@mbatchkarov我不知道pandas,我可以用它在数组或矩阵中获得预期结果吗?你能用你的方式写一个答案吗?@mbatchkarov嘿,我该怎么用?而第一行是header我怎样才能到达第一个元素即892和熊猫数据帧?我做了df[0:0]或df[0][0],但给出了错误。
print df.ix[0,'col1']
其中0是索引,
col1
是列的名称,或者
print df['col1']。值[0]
@carenvanderelee非常感谢。我有点困惑。如果第0个元素是数据,即892,我如何从
df
print pd.DataFrame(df).columns
中获取标题,将为您提供@carenvandlee列