Apache spark Pyspark：如何输入一个文本文件，使其被fullstop分割_Apache Spark_Pyspark

Apache spark Pyspark：如何输入一个文本文件，使其被fullstop分割

apache-spark pyspark

Apache spark Pyspark：如何输入一个文本文件，使其被fullstop分割,apache-spark,pyspark,Apache Spark,Pyspark,当我在RDD中加载一个文本文件时，默认情况下它是按每行分割的。例如，考虑下面的文本： Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s. When an unknown printer took a galley of type and sc

当我在RDD中加载一个文本文件时，默认情况下它是按每行分割的。例如，考虑下面的文本：

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum 
has been the industry's standard dummy text ever since the 1500s. When an 
unknown printer took a galley of type and scrambled it to make a type specimen book
and publish it.

如果我像下面这样将其加载到RDD中，数据将按每行分割

>>> RDD =sc.textFile("Dummy.txt")
>>> RDD.count()
    4
>>> RDD.collect()
    ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum ',
    'has been the industry's standard dummy text ever since the 1500s. When an ',
    'unknown printer took a galley of type and scrambled it to make a type specimen book',
    'and publish it.']

由于文本文件中有4行，

RDD.count（）

将4作为输出。类似地，列表

RDD.collect（）

包含4个字符串。但是，有没有一种方法可以加载您的文件，使其通过句子而不是行进行并行化，在这种情况下，输出应该如下所示

>>> RDD.count()
    3
>>> RDD.collect()
    ['Lorem Ipsum is simply dummy text of the printing and typesetting industry.', 'Lorem Ipsum 
    has been the industry's standard dummy text ever since the 1500s.', 'When an unknown
    printer took a galley of type and scrambled it to make a type specimen book and publish it.']

我是否可以将一些参数传递给

sc.textFile

，以便在出现句号时分割数据，而不是在文本文件中的一行结束时分割数据

我在作者编写的一个答案中得到了答案。答案如下：

rdd = sc.newAPIHadoopFile(YOUR_FILE, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
            conf={"textinputformat.record.delimiter": YOUR_DELIMITER}).map(lambda l:l[1])

RDD的textFile方法在内部使用hadoop的TextInputFormat读取文本文件。默认键、值对转换为记录偏移量和整个记录，默认分隔符为“\n” 完成此操作的简单方法是将文件作为dataFrame的csv方法读取，并将分隔符指定为“.”，如下所示：

spark.read.option("delimiter", ".").csv("path to your file")

这里的问题是，它将把句子分成列，而不是行，这对于数百个句子来说可能是不可行的

另一种方法是将hadoop的文本输入格式的默认分隔符从“\n”调整为“.”

这可以这样做

 val conf = new org.apache.hadoop.conf.Configuration
 conf.set("textinputformat.record.delimiter", "\u002E")
 sc.textFile.newAPIHadoopFile(file-path, 
     classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
     classOf[org.apache.hadoop.io.LongWritable],
     classOf[org.apache.hadoop.io.Text],
     conf).count()

或者，我想您也可以编写自定义的输入格式方法，并使用上面的newAPIHadoopFile或hadoopFile方法来读取Scala中的文件，我们可以执行collect（）
+.mkString
来创建字符串，然后在

上拆分
示例：
spark.sparkContext.parallelize(spark.sparkContext.textFile("<file_path>").collect().mkString.split("\\.")).count()

//3

spark.sparkContext.parallelize(spark.sparkContext.textFile("<file_path>").collect().mkString.split("\\.")).toDF().show(false)

//+----------------------------------------------------------------------------------------------------------+
//|_1                                                                                                        |
//+----------------------------------------------------------------------------------------------------------+
//|Lorem Ipsum is simply dummy text of the printing and typesetting industry                                 |
//| Lorem Ipsum has been the industry's standard dummy text ever since the 1500s                             |
//| When an unknown printer took a galley of type and scrambled it to make a type specimen bookand publish it|
//+----------------------------------------------------------------------------------------------------------+

spark.sparkContext.parallelize（spark.sparkContext.textFile（“”.collect（）.mkString.split（“\\”））.count（）
//3
spark.sparkContext.parallelize（spark.sparkContext.textFile（“”.collect（）.mkString.split（“\\”））.toDF（）.show（false）
//+----------------------------------------------------------------------------------------------------------+
//|_1                                                                                                        |
//+----------------------------------------------------------------------------------------------------------+
//|Lorem Ipsum只是印刷和排版行业的虚拟文本|
//|自16世纪以来，Lorem Ipsum一直是行业标准的虚拟文本|
//|当一个不知名的印刷商拿着一个铅字厨房，抢着做一本铅字样本书并出版时|
//+----------------------------------------------------------------------------------------------------------+
尝试使用适用于数据帧的选项（“multiLine”、“true”）@dassum。您不能为rdd的textFile方法指定选项，这是一个很好的技巧。但是对于一个巨大的文件来说这有多可行呢。收集（）是一个很大的负担。