Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用pyspark解析字段中包含换行符的CSV数据_Python_Regex_Apache Spark_Pyspark - Fatal编程技术网

Python 如何使用pyspark解析字段中包含换行符的CSV数据

Python 如何使用pyspark解析字段中包含换行符的CSV数据,python,regex,apache-spark,pyspark,Python,Regex,Apache Spark,Pyspark,源数据如下所示。第四条记录中的一个字段具有多行字符串 i1|j1|k1|l1|m1 i2|j2|k2|l2|m2 i3|j3|k3|l3|m3 i4|j4|k4|"l4 is multiline data multiline data"|m4 i5|j5|k5|l5|m5 我正在通过sc.wholeTextFiles rdd= sc.wholeTextFiles("file.csv").flatMap(lambda x: x[1].split("\n")) print rdd.take(100

源数据如下所示。第四条记录中的一个字段具有多行字符串

i1|j1|k1|l1|m1
i2|j2|k2|l2|m2
i3|j3|k3|l3|m3
i4|j4|k4|"l4 is
multiline data
multiline data"|m4
i5|j5|k5|l5|m5
我正在通过
sc.wholeTextFiles

rdd= sc.wholeTextFiles("file.csv").flatMap(lambda x: x[1].split("\n"))
print rdd.take(100)
print rdd.count()
rdd.take(100)
的输出:

rdd.count()的输出


这里的问题是,
多行
数据被视为新记录。因此,计数也随之增加。如何将该
多行
数据视为列的一个字符串值(以
l4
开头)?

一种方法是使用高级正则表达式忽略双引号中的换行符(仅受较新的
regex
模块支持):

读作

"[^"]*"(*SKIP)(*FAIL) # match anything between double quotes and "forget" the match
|                     # or
\n                    # match a newline

Python
中,这将是:

import regex as re

data = """i1|j1|k1|l1|m1
i2|j2|k2|l2|m2
i3|j3|k3|l3|m3
i4|j4|k4|"l4 is
multiline data
multiline data"|m4
i5|j5|k5|l5|m5"""

rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\n')

lines = rx.split(data)
print(lines)
这将导致

['i1|j1|k1|l1|m1', 'i2|j2|k2|l2|m2', 'i3|j3|k3|l3|m3', 'i4|j4|k4|"l4 is\nmultiline data\nmultiline data"|m4', 'i5|j5|k5|l5|m5']

请注意,转义引号(
\“
)会破坏该机制。

谢谢您的回复,但我希望在pyspark中这样做。
"[^"]*"(*SKIP)(*FAIL) # match anything between double quotes and "forget" the match
|                     # or
\n                    # match a newline
import regex as re

data = """i1|j1|k1|l1|m1
i2|j2|k2|l2|m2
i3|j3|k3|l3|m3
i4|j4|k4|"l4 is
multiline data
multiline data"|m4
i5|j5|k5|l5|m5"""

rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\n')

lines = rx.split(data)
print(lines)
['i1|j1|k1|l1|m1', 'i2|j2|k2|l2|m2', 'i3|j3|k3|l3|m3', 'i4|j4|k4|"l4 is\nmultiline data\nmultiline data"|m4', 'i5|j5|k5|l5|m5']