在python中读取tsv文件时忽略反斜杠_Python_Csv_Python 3.x_Pandas_Dataframe

在python中读取tsv文件时忽略反斜杠

python csv python-3.x pandas dataframe

在python中读取tsv文件时忽略反斜杠,python,csv,python-3.x,pandas,dataframe,Python,Csv,Python 3.x,Pandas,Dataframe,我有一个大的sep=“|”tsv，其中有一个地址字段，该字段有一组值，如下所示 ...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|... 其结果是： line1) ...xxx|yyy|Level 1 2 xxx Street\ line2) (MYCompany)|... 尝试运行quote=2在带有Pandas的read_表中将非数字转换为字符串，但它仍然将反斜杠视为新行。在包含反斜杠转义到新行的字段中，忽略具有值的行的有效方法是什么？是否有方法

我有一个大的

sep=“|”

tsv，其中有一个地址字段，该字段有一组值，如下所示

...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|...

其结果是：

line1)  ...xxx|yyy|Level 1 2 xxx Street\
line2)  (MYCompany)|...

尝试运行quote=2在带有Pandas的read_表中将非数字转换为字符串，但它仍然将反斜杠视为新行。在包含反斜杠转义到新行的字段中，忽略具有值的行的有效方法是什么？是否有方法忽略

的新行

理想情况下，它将准备数据文件，以便可以将其读入pandas中的数据帧中

更新：第三行显示5行破损

1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49  XXX  Ave|Australia|Victoria|3025|Melbourne
1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7  38-46 South Street|Australia|New South Wales|2116|Sydney
1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\
(My Company)|Australia|New South Wales|2000|Sydney|Sydney
1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie

我想你可以先试试sep，它的值是而不是，看起来读起来是正确的：

import pandas as pd import io temp=u""" 49 XXX Ave|Australia u7 38-46 South Street|Australia XXX Margaret Street\ New South Wales|Australia Po box ZZZ|Australia""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), sep="^", header=None) print df 0 0 49 XXX Ave|Australia 1 u7 38-46 South Street|Australia 2 XXX Margaret StreetNew South Wales|Australia 3 Po box ZZZ|Australia
然后，您可以使用和
sep=“|”
创建新文件：
下一个解决方案是不创建新文件，而是写入变量
output
，然后使用
io.StringIO
：

import pandas as pd import io temp=u""" 49 XXX Ave|Australia u7 38-46 South Street|Australia XXX Margaret Street\ New South Wales|Australia Po box ZZZ|Australia""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), sep=";", header=None) print df 0 0 49 XXX Ave|Australia 1 u7 38-46 South Street|Australia 2 XXX Margaret StreetNew South Wales|Australia 3 Po box ZZZ|Australia output = df.to_csv(header=False, index=False) print output 49 XXX Ave|Australia u7 38-46 South Street|Australia XXX Margaret StreetNew South Wales|Australia Po box ZZZ|Australia print pd.read_csv(io.StringIO(u""+output), sep="|", header=None) 0 1 0 49 XXX Ave Australia 1 u7 38-46 South Street Australia 2 XXX Margaret StreetNew South Wales Australia 3 Po box ZZZ Australia
如果我在你的数据中测试它，似乎是1。第2行有
14
字段，接下来有两个
15
字段
所以我从两行（3和4）中删除了最后一项，也许这只是打字错误（我希望是这样）：
但如果数据正确，则将参数
names=range（15）
添加到：

下面是另一个使用regex的解决方案：

import pandas as pd import re f = open('input.tsv') fl = f.read() f.close() #Replace '\\n' with '\' using regex fl = re.sub('\\\\\n','\\\\',s) o = open('input_fix.tsv','w') o.write(fl) o.close() cols = range(1,17) #Prime the number of columns by specifying names for each column #This takes care of the issue of variable number of columns df = pd.read_csv(fl,sep='|',names=cols)
将产生以下结果：

您能在tsv中提供3-4个示例行以及当前正在运行的代码吗？当然，我添加了4个示例行，显示tsv的外观，并添加了一行，其中反斜杠打断该行，并返回新行，用于该行的其余部分。您为什么要在其中添加这些新行字符？对我来说没有任何意义，只要在每个实际tsv行中保留一个换行符，你就可以避免整个混乱。我没有在那里添加它，它在我正在阅读的文件中，我试图忽略它：）它是一个DB转储，tsv疯狂是我正在处理的。
import pandas as pd import io temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne 1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney 1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\ (My Company)|Australia|New South Wales|2000|Sydney 1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), sep=";", header=None) print df 0 0 1788768|1831171|208434489|2014-08-14 13:40:02|... 1 1788772|1831177|202234489|2014-08-14 13:41:37|... 2 1788776|1831182|205234489|2014-08-14 13:42:41|... 3 1788780|1831186|202634489|2014-08-14 13:43:46|... output = df.to_csv(header=False, index=False)

print pd.read_csv(io.StringIO(u""+output), sep="|", header=None) 0 1 2 3 4 5 6 7 \ 0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop 1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS 2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop 3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop 8 9 10 11 \ 0 coupon 49 XXX Ave Australia Victoria 1 NaN u7 38-46 South Street Australia New South Wales 2 NaN Level XXX Margaret Street(My Company) Australia New South Wales 3 NaN Po box ZZZ Australia New South Wales 12 13 0 3025 Melbourne 1 2116 Sydney 2 2000 Sydney 3 2444 NSW Other

print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15)) 0 1 2 3 4 5 6 7 \ 0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop 1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS 2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop 3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop 8 9 10 11 \ 0 coupon 49 XXX Ave Australia Victoria 1 NaN u7 38-46 South Street Australia New South Wales 2 NaN Level XXX Margaret Street(My Company) Australia New South Wales 3 NaN Po box ZZZ Australia New South Wales 12 13 14 0 3025 Melbourne NaN 1 2116 Sydney NaN 2 2000 Sydney Sydney 3 2444 NSW Other Port Macquarie

import pandas as pd import re f = open('input.tsv') fl = f.read() f.close() #Replace '\\n' with '\' using regex fl = re.sub('\\\\\n','\\\\',s) o = open('input_fix.tsv','w') o.write(fl) o.close() cols = range(1,17) #Prime the number of columns by specifying names for each column #This takes care of the issue of variable number of columns df = pd.read_csv(fl,sep='|',names=cols)