Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/324.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用numpy或pandas从长格式操作csv文件_Python_Csv_File Io_Numpy_Pandas - Fatal编程技术网

Python 使用numpy或pandas从长格式操作csv文件

Python 使用numpy或pandas从长格式操作csv文件,python,csv,file-io,numpy,pandas,Python,Csv,File Io,Numpy,Pandas,我正在尝试编写一个简单的脚本,将一个csv输出文件从Fortran代码转换为一个Pandas DataFrame对象,以便进行更多的分析。csv有两列,但由多个附加的数据块组成,数据块的形状为[n,2](每个样本名称的格式为RN_x)。我得到了下面的代码,但是生成的DataFrame对象不允许分析。我在下面附上了一个示例文件(比原始文件缩短了很多)。顺便提一下,数据文件中的第一列是一个日期,但在输出中是一个数字,对应于si=模拟中的一天。任何建议都将不胜感激 import numpy as np

我正在尝试编写一个简单的脚本,将一个csv输出文件从Fortran代码转换为一个Pandas DataFrame对象,以便进行更多的分析。csv有两列,但由多个附加的数据块组成,数据块的形状为[n,2](每个样本名称的格式为RN_x)。我得到了下面的代码,但是生成的DataFrame对象不允许分析。我在下面附上了一个示例文件(比原始文件缩短了很多)。顺便提一下,数据文件中的第一列是一个日期,但在输出中是一个数字,对应于si=模拟中的一天。任何建议都将不胜感激

import numpy as np
import pandas as pd
import csv as csv
readdata = csv.reader(open('C:/data/Test.csv', 'r'))
data = []
for row in readdata:
    data.append(row)
a = np.array(data).reshape(11,-1, order = 'F')
col = a[0,:4].reshape(4)
row = pd.Index(a[4:,0:1].reshape(7))
b = a[4:,5:]
df = pd.DataFrame(b, index = row, columns = col)
样本:

RN_48865,
1,Observed
1,0
259,Computed
1,0.000014
91,0.000014
182,0.000014
274,0.000014
366,0.000014
457,0.000014
548,0.000014
RN_7445,
1,Observed
1,0
259,Computed
1,0.000013
91,0.000013
182,0.000013
274,0.000013
366,0.000013
457,0.000013
548,0.000013
RN_9288,
1,Observed
1,0
259,Computed
1,0.000011
91,0.000011
182,0.000011
274,0.000011
366,0.000011
457,0.000011
548,0.000011
RN_10955,
1,Observed
1,0
259,Computed
1,0.000014
91,0.000014
182,0.000014
274,0.000014
366,0.000014
457,0.000014
548,0.000014
样本输出:

Index,RN_48865,RN_7445,RN_9288,RN_10955
1,0.000014,0.000013,0.000011,0.000014
91,0.000014,0.000013,0.000011,0.000014
182,0.000014,0.000013,0.000011,0.000014
274,0.000014,0.000013,0.000011,0.000014
366,0.000014,0.000013,0.000011,0.000014
457,0.000014,0.000013,0.000011,0.000014
548,0.000014,0.000013,0.000011,0.000014

你实际上在问几个问题。这是我从期望的输出中可以理解的:

source="""RN_48865,
    1,Observed
    1,0
    259,Computed
    1,0.000014
    91,0.000014
    182,0.000014
    274,0.000014
    366,0.000014
    457,0.000014
    548,0.000014
    RN_7445,
    1,Observed
    1,0
    259,Computed
    1,0.000013
    91,0.000013
    182,0.000013
    274,0.000013
    366,0.000013
    457,0.000013
    548,0.000013
    RN_9288,
    1,Observed
    1,0
    259,Computed
    1,0.000011
    91,0.000011
    182,0.000011
    274,0.000011
    366,0.000011
    457,0.000011
    548,0.000011
    RN_10955,
    1,Observed
    1,0
    259,Computed
    1,0.000014
    91,0.000014
    182,0.000014
    274,0.000014
    366,0.000014
    457,0.000014
    548,0.000014
"""
import pandas as pd
import numpy as np
import StringIO
df = pd.read_csv(StringIO.StringIO(source), header=None)
rns = np.where(df[0].apply(lambda x: x.lstrip().startswith('RN_')))[0]
length = rns[1] - rns[0]
index = df[0].iloc[4:length]
cols = df[0][::length].apply(lambda x: x.lstrip()).values
result_df = pd.DataFrame(index=index)
for col_num, col_start in enumerate(range(0, len(df), length)):
    result_df[cols[col_num]] = df[1][col_num*length+4 : (col_num+1)*length].values
print result_df
输出:

     RN_48865   RN_7445   RN_9288  RN_10955
1    0.000014  0.000013  0.000011  0.000014
91   0.000014  0.000013  0.000011  0.000014
182  0.000014  0.000013  0.000011  0.000014
274  0.000014  0.000013  0.000011  0.000014
366  0.000014  0.000013  0.000011  0.000014
457  0.000014  0.000013  0.000011  0.000014
548  0.000014  0.000013  0.000011  0.000014
日期使用:

pandas.read_csv('file',
  parse_date=0,  # 0th column
  date_parser=lambda x: pandas.Timestamp('1995-1-1')+timedelta(x))

那么问题是什么?对不起,这不清楚。如何将长文件转换为具有索引(将数字添加到基准日期的已解析日期的索引,例如1995-1-1;第一个数据列)和多个列的Dataframe对象,其中第二列的数据填充了“RN_x”标签作为列标签。原始长文件具有重复的数据块,这些数据块表示一个计算中不同“位置”的输出。我希望能够分析每个位置的统计数据。我不理解“第二列的多个列中填充了数据,并将“RN_x”标签作为列标签。”为什么不简单地显示数据(使用
\n
s)?我可以通过电子邮件将文件发送给您吗?如果您向我们显示所需的输出,可能会更清楚,以及包含空格字符的精确输入。谢谢,这对一个元素很有帮助。用户cyborg指出,我不清楚我同意哪一个,这很好。非常感谢。看来我走错了方向。还有很多东西要学。