在python中读取tsv文件时忽略反斜杠
我有一个大的在python中读取tsv文件时忽略反斜杠,python,csv,python-3.x,pandas,dataframe,Python,Csv,Python 3.x,Pandas,Dataframe,我有一个大的sep=“|”tsv,其中有一个地址字段,该字段有一组值,如下所示 ...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|... 其结果是: line1) ...xxx|yyy|Level 1 2 xxx Street\ line2) (MYCompany)|... 尝试运行quote=2在带有Pandas的read_表中将非数字转换为字符串,但它仍然将反斜杠视为新行。在包含反斜杠转义到新行的字段中,忽略具有值的行的有效方法是什么?是否有方法
sep=“|”
tsv,其中有一个地址字段,该字段有一组值,如下所示
...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...
其结果是:
line1) ...xxx|yyy|Level 1 2 xxx Street\
line2) (MYCompany)|...
尝试运行quote=2在带有Pandas的read_表中将非数字转换为字符串,但它仍然将反斜杠视为新行。在包含反斜杠转义到新行的字段中,忽略具有值的行的有效方法是什么?是否有方法忽略\
的新行
理想情况下,它将准备数据文件,以便可以将其读入pandas中的数据帧中
更新:第三行显示5行破损
1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie
我想你可以先试试sep,它的值是而不是,看起来读起来是正确的:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep="^", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
然后,您可以使用和sep=“|”
创建新文件:
下一个解决方案是不创建新文件,而是写入变量output
,然后使用io.StringIO
:
import pandas as pd
import io
temp=u"""
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 49 XXX Ave|Australia
1 u7 38-46 South Street|Australia
2 XXX Margaret StreetNew South Wales|Australia
3 Po box ZZZ|Australia
output = df.to_csv(header=False, index=False)
print output
49 XXX Ave|Australia
u7 38-46 South Street|Australia
XXX Margaret StreetNew South Wales|Australia
Po box ZZZ|Australia
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1
0 49 XXX Ave Australia
1 u7 38-46 South Street Australia
2 XXX Margaret StreetNew South Wales Australia
3 Po box ZZZ Australia
如果我在你的数据中测试它,似乎是1。第2行有14
字段,接下来有两个15
字段
所以我从两行(3和4)中删除了最后一项,也许这只是打字错误(我希望是这样):
但如果数据正确,则将参数names=range(15)
添加到:
下面是另一个使用regex的解决方案:
import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()
#Replace '\\n' with '\' using regex
fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()
cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)
将产生以下结果:
您能在tsv中提供3-4个示例行以及当前正在运行的代码吗?当然,我添加了4个示例行,显示tsv的外观,并添加了一行,其中反斜杠打断该行,并返回新行,用于该行的其余部分。您为什么要在其中添加这些新行字符?对我来说没有任何意义,只要在每个实际tsv行中保留一个换行符,你就可以避免整个混乱。我没有在那里添加它,它在我正在阅读的文件中,我试图忽略它:)它是一个DB转储,tsv疯狂是我正在处理的。
import pandas as pd
import io
temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
0
0 1788768|1831171|208434489|2014-08-14 13:40:02|...
1 1788772|1831177|202234489|2014-08-14 13:41:37|...
2 1788776|1831182|205234489|2014-08-14 13:42:41|...
3 1788780|1831186|202634489|2014-08-14 13:43:46|...
output = df.to_csv(header=False, index=False)
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13
0 3025 Melbourne
1 2116 Sydney
2 2000 Sydney
3 2444 NSW Other
print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15))
0 1 2 3 4 5 6 7 \
0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop
1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS
2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop
3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop
8 9 10 11 \
0 coupon 49 XXX Ave Australia Victoria
1 NaN u7 38-46 South Street Australia New South Wales
2 NaN Level XXX Margaret Street(My Company) Australia New South Wales
3 NaN Po box ZZZ Australia New South Wales
12 13 14
0 3025 Melbourne NaN
1 2116 Sydney NaN
2 2000 Sydney Sydney
3 2444 NSW Other Port Macquarie
import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()
#Replace '\\n' with '\' using regex
fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()
cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)