Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/arrays/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Arrays 使用Scala从RDD中的数组[String]获取数据_Arrays_Scala_Apache Spark_Rdd - Fatal编程技术网

Arrays 使用Scala从RDD中的数组[String]获取数据

Arrays 使用Scala从RDD中的数组[String]获取数据,arrays,scala,apache-spark,rdd,Arrays,Scala,Apache Spark,Rdd,我有一个如下形式的数组[字符串]: res3: Array[String] = Array("{{Infobox officeholder |name=Abraham Lincoln |image=Abraham Lincoln November 1863.jpg{{!}}border |term_start=March 4, 1861 |term_end=April 15, 1865 |term_start2=March 4, 1847, "{{Infobox officeholder |na

我有一个如下形式的数组[字符串]:

res3: Array[String] =
Array("{{Infobox officeholder
|name=Abraham Lincoln
|image=Abraham Lincoln November 1863.jpg{{!}}border
|term_start=March 4, 1861
|term_end=April 15, 1865
|term_start2=March 4, 1847,
"{{Infobox officeholder
|name=Mickael Jackson
|term_start=April 9, 1991
|term_end=April 15, 1865
|term_start2=March 4, 1847")
("Abraham Lincoln: March 4, 1861",
"Michael Jackson: April 9, 1991",
...
现在,我需要创建一个表单数组:

res3: Array[String] =
Array("{{Infobox officeholder
|name=Abraham Lincoln
|image=Abraham Lincoln November 1863.jpg{{!}}border
|term_start=March 4, 1861
|term_end=April 15, 1865
|term_start2=March 4, 1847,
"{{Infobox officeholder
|name=Mickael Jackson
|term_start=April 9, 1991
|term_end=April 15, 1865
|term_start2=March 4, 1847")
("Abraham Lincoln: March 4, 1861",
"Michael Jackson: April 9, 1991",
...
但是,也就是说,术语_start并不总是在数组中的同一索引中。所以,我需要一些方法来为每一行使用regex或contains。 使用scala有什么方法可以做到这一点吗?数据从bz2文件加载,然后以这种方式转换。
非常感谢。

我不太了解您的输出格式,但这个使用数据帧的示例可能会帮助您解决问题:

case class Message(text: String)

val iterations: (String => Array[String]) = (input: String) => {
  input.split('|')
}
val udf_iterations = udf(iterations)

val transformation: (String => String) = (input: String) => {
  input.split("=")(1).trim + ": " + input.split("=")(0).trim
}
val udf_transformation = udf(transformation)

val p1 = Message("AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1")
val p2 = Message("ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2")

val records = Seq(p1, p2)
val df = spark.createDataFrame(records)

df.withColumn("text-explode", explode(udf_iterations(col("text"))))
  .withColumn("text-transformed", udf_transformation(col("text-explode")))
  .show(false)

+---------------------------------------+-------------+----------------+
|text                                   |text-explode |text-transformed|
+---------------------------------------+-------------+----------------+
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1|AAA=valAAA1  |valAAA1: AAA    |
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1| BBB=valBBB1 |valBBB1: BBB    |
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1| CCC=valCCC1 |valCCC1: CCC    |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2|ZZZ=valZZZ2  |valZZZ2: ZZZ    |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2| AAA=valAAA2 |valAAA2: AAA    |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2| BBB=valBBB2 |valBBB2: BBB    |
+---------------------------------------+-------------+----------------+

你能发布一下rdd.take(5)的样子吗?字符串看起来很奇怪。
res3:Array[string]=Array({{Infobox officeholder | name=Abraham Lincoln |前任=[[James Buchanan]]]|继任者=[[Andrew Johnson]]| office2=伊利诺伊州[[Illinois]第7国会选区]]的[[U.S.众议院]]议员]district | district2=| term _start2=1847年3月4日| term _en…
并且密钥的名称在某些情况下是相同的,我只对将名称的值和出生日期的值连接起来感兴趣。。。