Apache spark 如何在Spark中生成大字数文件?
我想为性能测试生成1000万行的wordcount文件(每行都有相同的句子)。但我不知道如何编写代码Apache spark 如何在Spark中生成大字数文件?,apache-spark,Apache Spark,我想为性能测试生成1000万行的wordcount文件(每行都有相同的句子)。但我不知道如何编写代码 您可以给我一个示例代码,并将文件直接保存在HDFS中。您可以尝试类似的方法 生成1列(值范围为1到100k)和1列(值范围为1到100),使用explode(列)分解这两个列。 您不能生成一个包含10 Mil值的列,因为kryo缓冲区将抛出一个错误 我不知道这是否是最好的表现方式,但这是我现在能想到的最快的方式 val generateList = udf((s: Int) => {
您可以给我一个示例代码,并将文件直接保存在HDFS中。您可以尝试类似的方法 生成1列(值范围为1到100k)和1列(值范围为1到100),使用explode(列)分解这两个列。 您不能生成一个包含10 Mil值的列,因为kryo缓冲区将抛出一个错误 我不知道这是否是最好的表现方式,但这是我现在能想到的最快的方式
val generateList = udf((s: Int) => {
val buf = scala.collection.mutable.ArrayBuffer.empty[Int]
for(i <- 1 to s) {
buf += i
}
buf
})
val someDF = Seq(
("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
).toDF("sentence")
val someDfWithMilColumn = someDF.withColumn("genColumn1", generateList(lit(100000)))
.withColumn("genColumn2", generateList(lit(100)))
val someDfWithMilColumn100k = someDfWithMilColumn
.withColumn("expl_val", explode($"mil")).drop("expl_val", "genColumn1")
val someDfWithMilColumn10mil = someDfWithMilColumn100k
.withColumn("expl_val2", explode($"10")).drop("genColumn2", "expl_val2")
someDfWithMilColumn10mil.write.parquet(path)
val generateList=udf((s:Int)=>{
val buf=scala.collection.mutable.ArrayBuffer.empty[Int]
对于(i你可以采用这种方法
Tail recursive生成对象列表和数据帧,Union生成大数据帧
val spark = SparkSession
.builder()
.appName("TenMillionsRows")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","TenMillionsRows") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
/**
* Returns a List of nums sentences
* @param sentence
* @param num
* @return
*/
def getList(sentence: String, num: Int) : List[String] = {
@tailrec
def loop(st: String,n: Int, acc: List[String]): List[String] = {
n match {
case num if num == 0 => acc
case _ => loop(st, n - 1, st :: acc)
}
}
loop(sentence,num,List())
}
/**
* Returns a Dataframe that is the union of nums dataframes
* @param lst
* @param num
* @return
*/
def getDataFrame(lst: List[String], num: Int): DataFrame = {
@tailrec
def loop (ls: List[String],n: Int, acc: DataFrame): DataFrame = {
n match {
case n if n == 0 => acc
case _ => loop(lst,n - 1, acc.union(sc.parallelize(ls).toDF("sentence")))
}
}
loop(lst, num, sc.parallelize(List(sentence)).toDF("sentence"))
}
val sentence = "hope for the best but prepare for the worst"
val lSentence = getList(sentence, 100000)
val dfs = getDataFrame(lSentence,100)
println(dfs.count())
// output: 10000001
dfs.write.orc("path_to_hdfs") // write dataframe to a orc file
// you can save the file as parquet, txt, json .......
// with dataframe.write
希望这对您有所帮助。您可以通过以下方式加入2个DFs来实现这一点:,
也可以在内联中找到代码解释
import org.apache.spark.sql.SaveMode
object GenerateTenMils {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
spark.conf.set("spark.sql.crossJoin.enabled","true") // Enable cross join
import spark.implicits._
//Create a DF with your sentence
val df = List("each line has the same sentence").toDF
//Create another Dataset with 10000000 records
spark.range(10000000)
.join(df) // Cross Join the dataframes
.coalesce(1) // Output to a single file
.drop("id") // Drop the extra column
.write
.mode(SaveMode.Overwrite)
.text("src/main/resources/tenMils") // Write as text file
}
}
澄清一下。你想创建一个1000万行的文件吗?是的,你说得对。
import org.apache.spark.sql.SaveMode
object GenerateTenMils {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
spark.conf.set("spark.sql.crossJoin.enabled","true") // Enable cross join
import spark.implicits._
//Create a DF with your sentence
val df = List("each line has the same sentence").toDF
//Create another Dataset with 10000000 records
spark.range(10000000)
.join(df) // Cross Join the dataframes
.coalesce(1) // Output to a single file
.drop("id") // Drop the extra column
.write
.mode(SaveMode.Overwrite)
.text("src/main/resources/tenMils") // Write as text file
}
}