Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/281.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在python中读取tsv文件时忽略反斜杠_Python_Csv_Python 3.x_Pandas_Dataframe - Fatal编程技术网

在python中读取tsv文件时忽略反斜杠

在python中读取tsv文件时忽略反斜杠,python,csv,python-3.x,pandas,dataframe,Python,Csv,Python 3.x,Pandas,Dataframe,我有一个大的sep=“|”tsv,其中有一个地址字段,该字段有一组值,如下所示 ...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|... 其结果是: line1) ...xxx|yyy|Level 1 2 xxx Street\ line2) (MYCompany)|... 尝试运行quote=2在带有Pandas的read_表中将非数字转换为字符串,但它仍然将反斜杠视为新行。在包含反斜杠转义到新行的字段中,忽略具有值的行的有效方法是什么?是否有方法

我有一个大的
sep=“|”
tsv,其中有一个地址字段,该字段有一组值,如下所示

...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...
其结果是:

line1)  ...xxx|yyy|Level 1 2 xxx Street\
line2)  (MYCompany)|...
尝试运行quote=2在带有Pandas的read_表中将非数字转换为字符串,但它仍然将反斜杠视为新行。在包含反斜杠转义到新行的字段中,忽略具有值的行的有效方法是什么?是否有方法忽略
\
的新行

理想情况下,它将准备数据文件,以便可以将其读入pandas中的数据帧中

更新:第三行显示5行破损

1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49  XXX  Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7  38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie
我想你可以先试试sep,它的值是而不是,看起来读起来是正确的:

import pandas as pd
import io

temp=u"""
49  XXX  Ave|Australia
u7  38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep="^", header=None)
print df
                                              0
0                        49  XXX  Ave|Australia
1              u7  38-46 South Street|Australia
2  XXX Margaret StreetNew South Wales|Australia
3                          Po box ZZZ|Australia
然后,您可以使用和
sep=“|”
创建新文件:

下一个解决方案是不创建新文件,而是写入变量
output
,然后使用
io.StringIO

import pandas as pd
import io

temp=u"""
49  XXX  Ave|Australia
u7  38-46 South Street|Australia
XXX Margaret Street\
New South Wales|Australia
Po box ZZZ|Australia"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
                                              0
0                        49  XXX  Ave|Australia
1              u7  38-46 South Street|Australia
2  XXX Margaret StreetNew South Wales|Australia
3                          Po box ZZZ|Australia

output = df.to_csv(header=False, index=False)
print output
49  XXX  Ave|Australia
u7  38-46 South Street|Australia
XXX Margaret StreetNew South Wales|Australia
Po box ZZZ|Australia

print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
                                    0          1
0                        49  XXX  Ave  Australia
1              u7  38-46 South Street  Australia
2  XXX Margaret StreetNew South Wales  Australia
3                          Po box ZZZ  Australia
如果我在你的数据中测试它,似乎是1。第2行有
14
字段,接下来有两个
15
字段

所以我从两行(3和4)中删除了最后一项,也许这只是打字错误(我希望是这样):

但如果数据正确,则将参数
names=range(15)
添加到:


下面是另一个使用regex的解决方案:

import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()

#Replace '\\n' with '\' using regex

fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()

cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)
将产生以下结果:


您能在tsv中提供3-4个示例行以及当前正在运行的代码吗?当然,我添加了4个示例行,显示tsv的外观,并添加了一行,其中反斜杠打断该行,并返回新行,用于该行的其余部分。您为什么要在其中添加这些新行字符?对我来说没有任何意义,只要在每个实际tsv行中保留一个换行符,你就可以避免整个混乱。我没有在那里添加它,它在我正在阅读的文件中,我试图忽略它:)它是一个DB转储,tsv疯狂是我正在处理的。
import pandas as pd

import io

temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49  XXX  Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7  38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", header=None)
print df
                                                   0
0  1788768|1831171|208434489|2014-08-14 13:40:02|...
1  1788772|1831177|202234489|2014-08-14 13:41:37|...
2  1788776|1831182|205234489|2014-08-14 13:42:41|...
3  1788780|1831186|202634489|2014-08-14 13:43:46|...

output = df.to_csv(header=False, index=False)
print pd.read_csv(io.StringIO(u""+output), sep="|", header=None)
        0        1          2                    3    4  5   6        7   \
0  1788768  1831171  208434489  2014-08-14 13:40:02  108  c NaN  Desktop   
1  1788772  1831177  202234489  2014-08-14 13:41:37  108  c NaN      iOS   
2  1788776  1831182  205234489  2014-08-14 13:42:41  108  c NaN  Desktop   
3  1788780  1831186  202634489  2014-08-14 13:43:46  108  c NaN  Desktop   

       8                                      9          10               11  \
0  coupon                           49  XXX  Ave  Australia         Victoria   
1     NaN                 u7  38-46 South Street  Australia  New South Wales   
2     NaN  Level XXX Margaret Street(My Company)  Australia  New South Wales   
3     NaN                             Po box ZZZ  Australia  New South Wales   

     12         13  
0  3025  Melbourne  
1  2116     Sydney  
2  2000     Sydney  
3  2444  NSW Other  
print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15))
        0        1          2                    3    4  5   6        7   \
0  1788768  1831171  208434489  2014-08-14 13:40:02  108  c NaN  Desktop   
1  1788772  1831177  202234489  2014-08-14 13:41:37  108  c NaN      iOS   
2  1788776  1831182  205234489  2014-08-14 13:42:41  108  c NaN  Desktop   
3  1788780  1831186  202634489  2014-08-14 13:43:46  108  c NaN  Desktop   

       8                                      9          10               11  \
0  coupon                           49  XXX  Ave  Australia         Victoria   
1     NaN                 u7  38-46 South Street  Australia  New South Wales   
2     NaN  Level XXX Margaret Street(My Company)  Australia  New South Wales   
3     NaN                             Po box ZZZ  Australia  New South Wales   

     12         13              14  
0  3025  Melbourne             NaN  
1  2116     Sydney             NaN  
2  2000     Sydney          Sydney  
3  2444  NSW Other  Port Macquarie  
import pandas as pd
import re
f = open('input.tsv')
fl = f.read()
f.close()

#Replace '\\n' with '\' using regex

fl = re.sub('\\\\\n','\\\\',s)
o = open('input_fix.tsv','w')
o.write(fl)
o.close()

cols = range(1,17)
#Prime the number of columns by specifying names for each column
#This takes care of the issue of variable number of columns
df = pd.read_csv(fl,sep='|',names=cols)