Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/string/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
String 具有多字符串串联的Spark映射_String_Scala_Apache Spark_Optimization - Fatal编程技术网

String 具有多字符串串联的Spark映射

String 具有多字符串串联的Spark映射,string,scala,apache-spark,optimization,String,Scala,Apache Spark,Optimization,我正在寻找一种方法来优化如下代码: // for each line do many string concatenations myRdd.map{x => "some_text" + x._1 + "some_other_text" + x._4 + ...} 我只是读到了使用 s"some_text${x._1}..." 将替换为基本的字符串连接,如在我的地图中 所以我的第一个想法是使用StringBuilder,比如 myRdd.map{x => val

我正在寻找一种方法来优化如下代码:

// for each line do many string concatenations
myRdd.map{x => "some_text" + x._1 + "some_other_text" + x._4 + ...}
我只是读到了使用

s"some_text${x._1}..." 
将替换为基本的字符串连接,如在我的地图中

所以我的第一个想法是使用StringBuilder,比如

   myRdd.map{x => 
    val sb = StringBuilder()
    sb.append("some_text")
    sb.append(x._1)
    ...
    sb

但是将为每一行创建StringBuilder ojbect这种优化是否有最佳实践比如在其他地方声明StringBuilder(对象或类属性)并在我的映射中始终使用相同的实例?

如果您反汇编代码
myRdd.map{x=>“一些文本”+x.\u1+“一些其他文本”+x.\u4+…}
,它将显示如下内容:

NEW java/lang/StringBuilder
DUP
LDC 24
INVOKESPECIAL java/lang/StringBuilder.<init> (I)V
LDC "some_text"
INVOKEVIRTUAL java/lang/StringBuilder.append (Ljava/lang/String;)Ljava/lang/StringBuilder;
ALOAD 0
INVOKEVIRTUAL scala/Tuple2._1 ()Ljava/lang/Object;
INVOKEVIRTUAL java/lang/StringBuilder.append (Ljava/lang/Object;)Ljava/lang/StringBuilder;
LDC "some_other_text"
INVOKEVIRTUAL java/lang/StringBuilder.append (Ljava/lang/String;)Ljava/lang/StringBuilder;
ALOAD 0
INVOKEVIRTUAL scala/Tuple2._2 ()Ljava/lang/Object;
INVOKEVIRTUAL java/lang/StringBuilder.append (Ljava/lang/Object;)Ljava/lang/StringBuilder;
INVOKEVIRTUAL java/lang/StringBuilder.toString ()Ljava/lang/String;
但我不知道这是否值得。在此循环中创建的字符串生成器将不会离开该区域,并将立即被GC清除


使用
StringBuilder
的方法的缺点是,它的可读性差得多,功能性差,而且不美观。如果这个片段没有造成严重的性能问题,我会继续使用字符串插值。记住,

而不是使用全局<代码> StringBuilder <代码>这是可变的,考虑使用<代码>列表>代码>来存储索引的文本和<代码> FlodLeule>代码>来连接文本,如下所示:

val rdd = sc.parallelize(Seq(
  ("a", "b", "c", "d", "e"),
  ("f", "g", "h", "i", "j")
))

val textList = List((1, "x1"), (3, "x3"), (4, "x4"))

rdd.map( r => textList.foldLeft("")( (acc, kv) =>
  acc + kv._2 + r.productElement(kv._1 - 1)
) ).
collect
// res1: Array[String] = Array(x1ax3cx4d, x1fx3hx4i)
val rdd = sc.parallelize(Seq(
  ("a", "b", "c", "d", "e"),
  ("f", "g", "h", "i", "j")
))

val textList = List((1, "x1"), (3, "x3"), (4, "x4"))

rdd.map( r => textList.foldLeft("")( (acc, kv) =>
  acc + kv._2 + r.productElement(kv._1 - 1)
) ).
collect
// res1: Array[String] = Array(x1ax3cx4d, x1fx3hx4i)