Python 如何在TSV文件中用制表符替换逗号
在下面的数据框中,我试图用制表符和下面的字符串替换Python 如何在TSV文件中用制表符替换逗号,python,pandas,Python,Pandas,在下面的数据框中,我试图用制表符和下面的字符串替换曲线类型、到期日、债券、地理时间列中的逗号,这样我就可以从中创建新列 curv_typ,maturity,bonds,geo\time 2015M06D16 2015M06D15 2015M06D11 \ 0 PYC_RT,Y1,GBAAA,EA -0.24 -0.24 -0.24 1 PYC_RT,Y1,GBA_AAA,E
曲线类型、到期日、债券、地理时间
列中的逗号,这样我就可以从中创建新列
curv_typ,maturity,bonds,geo\time 2015M06D16 2015M06D15 2015M06D11 \
0 PYC_RT,Y1,GBAAA,EA -0.24 -0.24 -0.24
1 PYC_RT,Y1,GBA_AAA,EA -0.02 -0.03 -0.10
2 PYC_RT,Y10,GBAAA,EA 0.94 0.92 0.99
3 PYC_RT,Y10,GBA_AAA,EA 1.67 1.70 1.60
4 PYC_RT,Y11,GBAAA,EA 1.03 1.01 1.09
代码如下所示,但它并没有去掉逗号,这就是我正在努力的地方
import os
import urllib2
import gzip
import StringIO
import pandas as pd
baseURL = "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file="
filename = "data/irt_euryld_d.tsv.gz"
outFilePath = filename.split('/')[1][:-3]
response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
with open(outFilePath, 'w') as outfile:
outfile.write(decompressedFile.read())
#Now have to deal with tsv file
import csv
outFilePath = filename.split('/')[1][:-3] #As in the code above, just put here for reference
csvout = 'C:\Users\Sidney\ECB.tsv'
outfile = open(csvout, "w")
with open(outFilePath, "rb") as f:
for line in f.read():
line.replace(",", "\t")
outfile.write(line)
outfile.close()
df = pd.DataFrame.from_csv("ECB.tsv", sep="\t", index_col=False)
谢谢拆分列名以生成新列名,然后使用param
expand=True调用向量化方法
:
In [26]:
cols = 'curv_typ,maturity,bonds,geo\\time'.split(',')
df[cols] = df['curv_typ,maturity,bonds,geo\\time'].str.split(',', expand=True)
df
Out[26]:
curv_typ,maturity,bonds,geo\time 2015M06D16 2015M06D15 2015M06D11 \
0 PYC_RT,Y1,GBAAA,EA -0.24 -0.24 -0.24
1 PYC_RT,Y1,GBA_AAA,EA -0.02 -0.03 -0.10
2 PYC_RT,Y10,GBAAA,EA 0.94 0.92 0.99
3 PYC_RT,Y10,GBA_AAA,EA 1.67 1.70 1.60
4 PYC_RT,Y11,GBAAA,EA 1.03 1.01 1.09
curv_typ maturity bonds geo\time
0 PYC_RT Y1 GBAAA EA
1 PYC_RT Y1 GBA_AAA EA
2 PYC_RT Y10 GBAAA EA
3 PYC_RT Y10 GBA_AAA EA
4 PYC_RT Y11 GBAAA EA
编辑
对于pandas版本0.16.0
及更高版本,则需要使用以下行:
df[cols] = df['curv_typ,maturity,bonds,geo\\time'].str.split(',').apply(pd.Series)
我也有同样的问题。从具有相同结构的欧盟统计局下载的数据。我尝试了@EdChum的解决方案,但我无法一蹴而就,因此我需要进一步的步骤:
vc.head() # The original DataFrame
Out[150]:
expend,unit,geo\time 2015 2014 2013 2012 2011 2010 2009 \
0 INV,MIO_EUR,AT 109 106.0 86.0 155.0 124.0 130.0 140.0
1 INV,MIO_EUR,BE 722 664.0 925.0 522.0 590.0 476.0 1018.0
2 INV,MIO_EUR,BG 16 1.0 2.0 65.0 11.0 5.0 6.0
3 INV,MIO_EUR,CH 640 1237.0 609.0 662.0 640.0 1555.0 718.0
4 INV,MIO_EUR,CZ 13 14.0 24.0 17.0 193.0 37.0 61.0
cols = 'expend,unit,geo\time'.split(',') # Getting the columnns
clean = vc.iloc[:,0].str.split(',').apply(pd.Series) # Creating a clean version
clean = clean.rename(columns = lambda x: cols[x]) # Adding the column names to the clean version
vc = pd.concat([clean, vc.iloc[:,1:]], axis = 1) # Concatenating the two tables
vc.head()
Out[155]:
expend unit geo\time 2015 2014 2013 2012 2011 2010 2009 \
0 INV MIO_EUR AT 109 106.0 86.0 155.0 124.0 130.0 140.0
1 INV MIO_EUR BE 722 664.0 925.0 522.0 590.0 476.0 1018.0
2 INV MIO_EUR BG 16 1.0 2.0 65.0 11.0 5.0 6.0
3 INV MIO_EUR CH 640 1237.0 609.0 662.0 640.0 1555.0 718.0
4 INV MIO_EUR CZ 13 14.0 24.0 17.0 193.0 37.0 61.0
检查此解决方案,谢谢。然而,当我在上面的代码后面添加
cols='curv\u-typ,maturity,bonds,geo\\time',split(',')df[cols]=df[cols].str.split(',')
时,我似乎得到了一个错误。你知道这可能是为什么吗?很抱歉,在加载df后必须拆分和添加新列,在哪一点上会出现错误?错误出现在以下行中:df[cols]=df['curv_-typ,maturity,bonds,geo\\time'].str.split(',')
Trydf[cols]=df[df.columns[0]].str.split(','))
您可以尝试升级您的pandas版本吗