Apache spark 读取ASCII字符Pypark上的csv和连接行_Apache Spark_Pyspark_Pyspark Sql

Apache spark 读取ASCII字符Pypark上的csv和连接行

apache-spark pyspark

Apache spark 读取ASCII字符Pypark上的csv和连接行,apache-spark,pyspark,pyspark-sql,Apache Spark,Pyspark,Pyspark Sql,我有一个以下格式的csv文件- id1,"When I think about the short time that we live and relate it to á the periods of my life when I think that I did not use this á short time." id2,"[ On days when I feel close to my partner and other friends. á When I feel at peac

我有一个以下格式的csv文件-

id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends.  á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"

我想读Pypark。我的密码是-

schema = StructType([
    StructField("Id", StringType()),
    StructField("Sentence", StringType()),
  ])

df = sqlContext.read.format("com.databricks.spark.csv") \
        .option("header", "false") \
        .option("inferSchema", "false") \
        .option("delimiter", "\"") \
        .schema(schema) \
        .load("mycsv.csv")

但我得到的结果是-

+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id                                                           | Sentence                                                           |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1,                                                          |When I think about the short time that we live and relate it to á  |
|the periods of my life when I think that I did not use this á |null                                                               |
|short time.                                                   |"                                                                  |

我想在两列中阅读它，一列包含

Id

，另一列包含

句子

。句子应该以ASCII字符

á

连接，因为我看到它在下一行读取，而没有得到分隔符

我的输出应该是这样的-

    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    | Id                                                           | Sentence                                                                 |
    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    |id1,                                                          |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |

在示例中，我只考虑了一个id。

我的代码需要做哪些修改？

只要将Spark更新到2.2或更高版本，如果您还没有这样做，请使用

多行

选项：

df=spark.read
.选项（“标题”、“假”）\
.选项（“推断模式”、“错误”）\
.option（“分隔符”、“\”）\
.schema（schema）\
.csv（“mycsv.csv”，多行=True）

如果这样做，可以使用

regexp\u replace

删除

：

df.withColumn("Sentence", regexp_replace("Sentence", "á", "")

将csv作为文本文件读取。使用此分隔符“，\”拆分rdd，以便将id1作为rdd[0]元素，将文本作为rdd[1]元素。将它们放在数据框中。

data=sc.textFile（“mycsv.csv”）df=sqlContext.createDataFrame（data.map（lambda行：line.split（“，\”））.filter（lambda行：len（line）>1.map（lambda行：（行[0]，行[1]））.toDF（“Id”，“句子”）

我也尝试过这样做，但它只读取了id1的一行，跳过了id1的所有其他行。我理解这个问题，但不太确定如何解决它。让您知道我是否能想出一些办法。我认为在拆分rdd之前，我们应该尝试将ASCII字符上的行合并到一行中。只有这样，我们才能使用上述代码是的，我们需要在获取

时连接行，然后拆分为2个rdd。感谢它起作用。我不知道多行功能。我必须在文件顶部添加#-*-编码：utf-8-*-
，以读取非ASCII字符á
。