如何使用pandas库将带有奇怪分隔符的CSV文件转换为Python中的数据帧

如何使用pandas库将带有奇怪分隔符的CSV文件转换为Python中的数据帧,python,pandas,Python,Pandas,我正在尝试使用Python将CSV文件转换为数据帧,但分隔符导致了问题 CSV文件是软件的输出,该软件将数据排列成一行,主要用“,”分隔 CSV文件中只有两行。第一个是: Date," 2015-01-30"," 2015-01-31"," 2015-02-01"," 2015-02-02"," 2015-02-03"," 2015-02-04"," 2015-02-05"," 2015-02-06"," 第二个是: Amount,"14000.030000000

我正在尝试使用Python将CSV文件转换为数据帧,但分隔符导致了问题

CSV文件是软件的输出,该软件将数据排列成一行,主要用“,”分隔

CSV文件中只有两行。第一个是:

Date,"  2015-01-30","   2015-01-31","   2015-02-01","   2015-02-02","   2015-02-03","   2015-02-04","   2015-02-05","   2015-02-06","
第二个是:

Amount,"14000.030000000002","13500.650000000001","26200.15000000001","33000.38000000002","38000.31000000003","29000.670000000013","29000.920000000016","31000.360000000015"
这是我迄今为止编写的代码:

data = pd.read_csv("csv_file_one_line.csv", sep = '","' , engine = 'python')
data.stack(level=0)
这是我得出的结果:

0  "Date,""\t2019-01-30"    "Amount,""14000.030000000002"
   "\t2019-01-31"                    "13500.650000000001"
   "\t2019-02-01"                     "26200.15000000001"
   "\t2019-02-02"                     "33000.38000000002"
   "\t2019-02-03"                     "38000.31000000003"
   "\t2019-02-04"                    "29000.670000000013"
   "\t2019-02-05"                    "29000.920000000016"
   "\t2019-02-06"                  "31000.360000000015"""
dtype: object

预期结果将是一个干净的数据框架,其中有两列分别带有标题日期和金额。从那以后,我将使用ARIMA建立一个预测模型。

在这一行-
data=pd.read\u csv(“csv\u file\u one\u line.csv”,sep=“,”,engine='python')
,您是基于
,而不是简单的
进行分离的


只需使用逗号,而不是撇号和逗号。

此数据将是正常的CSV文件,但问题是第一行末尾的“”

因此,将其作为普通文本读取并删除第一行中的“”,我将创建正确的CSV文件

with open('input.csv') as f1, open('output.txt', 'w') as f2:
    for row in f1:
        row = row.rstrip('\n')
        if row.endswith(',"'):
            row = row[:-2]
        f2.write(row+'\n')
之后,我可以阅读它使用标准设置

df = pd.read_csv("output.txt", header=None)
但它给

        0                   1  ...                   7                   8
0    Date          2015-01-30  ...          2015-02-05          2015-02-06
1  Amount  14000.030000000002  ...  29000.920000000016  31000.360000000015

[2 rows x 9 columns]
换位后

df = df.T
我明白了

                0                   1
0           Date              Amount
1     2015-01-30  14000.030000000002
2     2015-01-31  13500.650000000001
3     2015-02-01   26200.15000000001
4     2015-02-02   33000.38000000002
5     2015-02-03   38000.31000000003
6     2015-02-04  29000.670000000013
7     2015-02-05  29000.920000000016
8     2015-02-06  31000.360000000015
之后,我可以使用第一行作为列的名称,并删除这一行

df = df.rename(columns=df.iloc[0])
df = df.drop(0)
结果:

            Date              Amount
1     2015-01-30  14000.030000000002
2     2015-01-31  13500.650000000001
3     2015-02-01   26200.15000000001
4     2015-02-02   33000.38000000002
5     2015-02-03   38000.31000000003
6     2015-02-04  29000.670000000013
7     2015-02-05  29000.920000000016
8     2015-02-06  31000.360000000015

使用更复杂的正则表达式:

df=pd.read_csv(ff,sep=r',"\s*|","\s*',engine="python")


     Date  2015-01-30  2015-01-31  2015-02-01  ...  2015-02-04  2015-02-05           2015-02-06  Unnamed: 9
0  Amount    14000.03    13500.65    26200.15  ...    29000.67    29000.92  31000.360000000015"         NaN

[1 rows x 10 columns]


df= df.drop(columns="Date").stack().droplevel(0).reset_index()


        index                    0
0  2015-01-30                14000
1  2015-01-31              13500.6
2  2015-02-01              26200.2
3  2015-02-02              33000.4
4  2015-02-03              38000.3
5  2015-02-04              29000.7
6  2015-02-05              29000.9
7  2015-02-06  31000.360000000015"

df.columns=["Date","Amount"]
df.iloc[-1,-1]= np.float64(df.iloc[-1,-1].rstrip('"'))

         Date              Amount
0  2015-01-30               14000
1  2015-01-31             13500.6
2  2015-02-01             26200.2
3  2015-02-02             33000.4
4  2015-02-03             38000.3
5  2015-02-04             29000.7
6  2015-02-05             29000.9
7  2015-02-06             31000.4
编辑: 或者读取文件并将“、”转换为空白,然后使用StringIO读取数据。 分隔符现在仅为空白

import io,re

with open("data.csv") as ff:
   text=re.sub(r'[,"]',' ', ff.read())

f2= io.StringIO(text)
df=pd.read_csv(f2, sep=r'\s+',engine="python")

Date  2015-01-30  2015-01-31  2015-02-01  2015-02-02  2015-02-03  2015-02-04  2015-02-05  2015-02-06
0  Amount    14000.03    13500.65    26200.15    33000.38    38000.31    29000.67    29000.92    31000.36

etc.

这是正常的CVS文件,分隔符是
,而不是
,“
。CSV有时会将值放入
,但这是正常的,熊猫应该使用标准设置读取。实际上,分隔符看起来只是一个逗号,大多数(但不是全部)都是逗号用引号括起来的值。这个CSV是如何创建的?行/列似乎已被转置。唯一的问题是第一行末尾的“”。所以,您可以首先将文件作为普通文本读取,从第一行中删除
,“
,将其另存为普通文本文件,然后使用带有标准分隔符的
读取csv