如何使用pandas库将带有奇怪分隔符的CSV文件转换为Python中的数据帧
我正在尝试使用Python将CSV文件转换为数据帧,但分隔符导致了问题 CSV文件是软件的输出,该软件将数据排列成一行,主要用“,”分隔 CSV文件中只有两行。第一个是:如何使用pandas库将带有奇怪分隔符的CSV文件转换为Python中的数据帧,python,pandas,Python,Pandas,我正在尝试使用Python将CSV文件转换为数据帧,但分隔符导致了问题 CSV文件是软件的输出,该软件将数据排列成一行,主要用“,”分隔 CSV文件中只有两行。第一个是: Date," 2015-01-30"," 2015-01-31"," 2015-02-01"," 2015-02-02"," 2015-02-03"," 2015-02-04"," 2015-02-05"," 2015-02-06"," 第二个是: Amount,"14000.030000000
Date," 2015-01-30"," 2015-01-31"," 2015-02-01"," 2015-02-02"," 2015-02-03"," 2015-02-04"," 2015-02-05"," 2015-02-06","
第二个是:
Amount,"14000.030000000002","13500.650000000001","26200.15000000001","33000.38000000002","38000.31000000003","29000.670000000013","29000.920000000016","31000.360000000015"
这是我迄今为止编写的代码:
data = pd.read_csv("csv_file_one_line.csv", sep = '","' , engine = 'python')
data.stack(level=0)
这是我得出的结果:
0 "Date,""\t2019-01-30" "Amount,""14000.030000000002"
"\t2019-01-31" "13500.650000000001"
"\t2019-02-01" "26200.15000000001"
"\t2019-02-02" "33000.38000000002"
"\t2019-02-03" "38000.31000000003"
"\t2019-02-04" "29000.670000000013"
"\t2019-02-05" "29000.920000000016"
"\t2019-02-06" "31000.360000000015"""
dtype: object
预期结果将是一个干净的数据框架,其中有两列分别带有标题日期和金额。从那以后,我将使用ARIMA建立一个预测模型。在这一行-
data=pd.read\u csv(“csv\u file\u one\u line.csv”,sep=“,”,engine='python')
,您是基于,
,而不是简单的,
进行分离的
只需使用逗号,而不是撇号和逗号。此数据将是正常的CSV文件,但问题是第一行末尾的“” 因此,将其作为普通文本读取并删除第一行中的“”,我将创建正确的CSV文件
with open('input.csv') as f1, open('output.txt', 'w') as f2:
for row in f1:
row = row.rstrip('\n')
if row.endswith(',"'):
row = row[:-2]
f2.write(row+'\n')
之后,我可以阅读它使用标准设置
df = pd.read_csv("output.txt", header=None)
但它给
0 1 ... 7 8
0 Date 2015-01-30 ... 2015-02-05 2015-02-06
1 Amount 14000.030000000002 ... 29000.920000000016 31000.360000000015
[2 rows x 9 columns]
换位后
df = df.T
我明白了
0 1
0 Date Amount
1 2015-01-30 14000.030000000002
2 2015-01-31 13500.650000000001
3 2015-02-01 26200.15000000001
4 2015-02-02 33000.38000000002
5 2015-02-03 38000.31000000003
6 2015-02-04 29000.670000000013
7 2015-02-05 29000.920000000016
8 2015-02-06 31000.360000000015
之后,我可以使用第一行作为列的名称,并删除这一行
df = df.rename(columns=df.iloc[0])
df = df.drop(0)
结果:
Date Amount
1 2015-01-30 14000.030000000002
2 2015-01-31 13500.650000000001
3 2015-02-01 26200.15000000001
4 2015-02-02 33000.38000000002
5 2015-02-03 38000.31000000003
6 2015-02-04 29000.670000000013
7 2015-02-05 29000.920000000016
8 2015-02-06 31000.360000000015
使用更复杂的正则表达式:
df=pd.read_csv(ff,sep=r',"\s*|","\s*',engine="python")
Date 2015-01-30 2015-01-31 2015-02-01 ... 2015-02-04 2015-02-05 2015-02-06 Unnamed: 9
0 Amount 14000.03 13500.65 26200.15 ... 29000.67 29000.92 31000.360000000015" NaN
[1 rows x 10 columns]
df= df.drop(columns="Date").stack().droplevel(0).reset_index()
index 0
0 2015-01-30 14000
1 2015-01-31 13500.6
2 2015-02-01 26200.2
3 2015-02-02 33000.4
4 2015-02-03 38000.3
5 2015-02-04 29000.7
6 2015-02-05 29000.9
7 2015-02-06 31000.360000000015"
df.columns=["Date","Amount"]
df.iloc[-1,-1]= np.float64(df.iloc[-1,-1].rstrip('"'))
Date Amount
0 2015-01-30 14000
1 2015-01-31 13500.6
2 2015-02-01 26200.2
3 2015-02-02 33000.4
4 2015-02-03 38000.3
5 2015-02-04 29000.7
6 2015-02-05 29000.9
7 2015-02-06 31000.4
编辑:
或者读取文件并将“、”转换为空白,然后使用StringIO读取数据。
分隔符现在仅为空白
import io,re
with open("data.csv") as ff:
text=re.sub(r'[,"]',' ', ff.read())
f2= io.StringIO(text)
df=pd.read_csv(f2, sep=r'\s+',engine="python")
Date 2015-01-30 2015-01-31 2015-02-01 2015-02-02 2015-02-03 2015-02-04 2015-02-05 2015-02-06
0 Amount 14000.03 13500.65 26200.15 33000.38 38000.31 29000.67 29000.92 31000.36
etc.
这是正常的CVS文件,分隔符是
,
,而不是,“
。CSV有时会将值放入”
,但这是正常的,熊猫应该使用标准设置读取。实际上,分隔符看起来只是一个逗号,大多数(但不是全部)都是逗号用引号括起来的值。这个CSV是如何创建的?行/列似乎已被转置。唯一的问题是第一行末尾的“”。所以,您可以首先将文件作为普通文本读取,从第一行中删除,“
,将其另存为普通文本文件,然后使用带有标准分隔符的读取csv
。