Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 读取ASCII字符Pypark上的csv和连接行_Apache Spark_Pyspark_Pyspark Sql - Fatal编程技术网

Apache spark 读取ASCII字符Pypark上的csv和连接行

Apache spark 读取ASCII字符Pypark上的csv和连接行,apache-spark,pyspark,pyspark-sql,Apache Spark,Pyspark,Pyspark Sql,我有一个以下格式的csv文件- id1,"When I think about the short time that we live and relate it to á the periods of my life when I think that I did not use this á short time." id2,"[ On days when I feel close to my partner and other friends. á When I feel at peac

我有一个以下格式的csv文件-

id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends.  á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"
我想读Pypark。我的密码是-

schema = StructType([
    StructField("Id", StringType()),
    StructField("Sentence", StringType()),
  ])

df = sqlContext.read.format("com.databricks.spark.csv") \
        .option("header", "false") \
        .option("inferSchema", "false") \
        .option("delimiter", "\"") \
        .schema(schema) \
        .load("mycsv.csv")
但我得到的结果是-

+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id                                                           | Sentence                                                           |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1,                                                          |When I think about the short time that we live and relate it to á  |
|the periods of my life when I think that I did not use this á |null                                                               |
|short time.                                                   |"                                                                  |

我想在两列中阅读它,一列包含
Id
,另一列包含
句子
。 句子应该以ASCII字符
á
连接,因为我看到它在下一行读取,而没有得到分隔符

我的输出应该是这样的-

    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    | Id                                                           | Sentence                                                                 |
    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    |id1,                                                          |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |
在示例中,我只考虑了一个id。
我的代码需要做哪些修改?

只要将Spark更新到2.2或更高版本,如果您还没有这样做,请使用
多行
选项:

df=spark.read
.选项(“标题”、“假”)\
.选项(“推断模式”、“错误”)\
.option(“分隔符”、“\”)\
.schema(schema)\
.csv(“mycsv.csv”,多行=True)
如果这样做,可以使用
regexp\u replace
删除
a

df.withColumn("Sentence", regexp_replace("Sentence", "á", "")

将csv作为文本文件读取。使用此分隔符“,\”拆分rdd,以便将id1作为rdd[0]元素,将文本作为rdd[1]元素。将它们放在数据框中。
data=sc.textFile(“mycsv.csv”)df=sqlContext.createDataFrame(data.map(lambda行:line.split(“,\”)).filter(lambda行:len(line)>1.map(lambda行:(行[0],行[1])).toDF(“Id”,“句子”)
我也尝试过这样做,但它只读取了id1的一行,跳过了id1的所有其他行。我理解这个问题,但不太确定如何解决它。让您知道我是否能想出一些办法。我认为在拆分rdd之前,我们应该尝试将ASCII字符上的行合并到一行中。只有这样,我们才能使用上述代码是的,我们需要在获取
时连接行,然后拆分为2个rdd。感谢它起作用。我不知道多行功能。我必须在文件顶部添加
#-*-编码:utf-8-*-
,以读取非ASCII字符
á