Apache spark 读取ASCII字符Pypark上的csv和连接行
我有一个以下格式的csv文件-Apache spark 读取ASCII字符Pypark上的csv和连接行,apache-spark,pyspark,pyspark-sql,Apache Spark,Pyspark,Pyspark Sql,我有一个以下格式的csv文件- id1,"When I think about the short time that we live and relate it to á the periods of my life when I think that I did not use this á short time." id2,"[ On days when I feel close to my partner and other friends. á When I feel at peac
id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends. á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"
我想读Pypark。我的密码是-
schema = StructType([
StructField("Id", StringType()),
StructField("Sentence", StringType()),
])
df = sqlContext.read.format("com.databricks.spark.csv") \
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.load("mycsv.csv")
但我得到的结果是-
+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to á |
|the periods of my life when I think that I did not use this á |null |
|short time. |" |
我想在两列中阅读它,一列包含Id
,另一列包含句子
。
句子应该以ASCII字符á
连接,因为我看到它在下一行读取,而没有得到分隔符
我的输出应该是这样的-
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |
在示例中,我只考虑了一个id。
我的代码需要做哪些修改?只要将Spark更新到2.2或更高版本,如果您还没有这样做,请使用
多行
选项:
df=spark.read
.选项(“标题”、“假”)\
.选项(“推断模式”、“错误”)\
.option(“分隔符”、“\”)\
.schema(schema)\
.csv(“mycsv.csv”,多行=True)
如果这样做,可以使用regexp\u replace
删除a
:
df.withColumn("Sentence", regexp_replace("Sentence", "á", "")
将csv作为文本文件读取。使用此分隔符“,\”拆分rdd,以便将id1作为rdd[0]元素,将文本作为rdd[1]元素。将它们放在数据框中。
data=sc.textFile(“mycsv.csv”)df=sqlContext.createDataFrame(data.map(lambda行:line.split(“,\”)).filter(lambda行:len(line)>1.map(lambda行:(行[0],行[1])).toDF(“Id”,“句子”)
我也尝试过这样做,但它只读取了id1的一行,跳过了id1的所有其他行。我理解这个问题,但不太确定如何解决它。让您知道我是否能想出一些办法。我认为在拆分rdd之前,我们应该尝试将ASCII字符上的行合并到一行中。只有这样,我们才能使用上述代码是的,我们需要在获取时连接行,然后拆分为2个rdd。感谢它起作用。我不知道多行功能。我必须在文件顶部添加#-*-编码:utf-8-*-
,以读取非ASCII字符á
。