Arrays 使用Scala从RDD中的数组[String]获取数据
我有一个如下形式的数组[字符串]:Arrays 使用Scala从RDD中的数组[String]获取数据,arrays,scala,apache-spark,rdd,Arrays,Scala,Apache Spark,Rdd,我有一个如下形式的数组[字符串]: res3: Array[String] = Array("{{Infobox officeholder |name=Abraham Lincoln |image=Abraham Lincoln November 1863.jpg{{!}}border |term_start=March 4, 1861 |term_end=April 15, 1865 |term_start2=March 4, 1847, "{{Infobox officeholder |na
res3: Array[String] =
Array("{{Infobox officeholder
|name=Abraham Lincoln
|image=Abraham Lincoln November 1863.jpg{{!}}border
|term_start=March 4, 1861
|term_end=April 15, 1865
|term_start2=March 4, 1847,
"{{Infobox officeholder
|name=Mickael Jackson
|term_start=April 9, 1991
|term_end=April 15, 1865
|term_start2=March 4, 1847")
("Abraham Lincoln: March 4, 1861",
"Michael Jackson: April 9, 1991",
...
现在,我需要创建一个表单数组:
res3: Array[String] =
Array("{{Infobox officeholder
|name=Abraham Lincoln
|image=Abraham Lincoln November 1863.jpg{{!}}border
|term_start=March 4, 1861
|term_end=April 15, 1865
|term_start2=March 4, 1847,
"{{Infobox officeholder
|name=Mickael Jackson
|term_start=April 9, 1991
|term_end=April 15, 1865
|term_start2=March 4, 1847")
("Abraham Lincoln: March 4, 1861",
"Michael Jackson: April 9, 1991",
...
但是,也就是说,术语_start并不总是在数组中的同一索引中。所以,我需要一些方法来为每一行使用regex或contains。
使用scala有什么方法可以做到这一点吗?数据从bz2文件加载,然后以这种方式转换。
非常感谢。我不太了解您的输出格式,但这个使用数据帧的示例可能会帮助您解决问题:
case class Message(text: String)
val iterations: (String => Array[String]) = (input: String) => {
input.split('|')
}
val udf_iterations = udf(iterations)
val transformation: (String => String) = (input: String) => {
input.split("=")(1).trim + ": " + input.split("=")(0).trim
}
val udf_transformation = udf(transformation)
val p1 = Message("AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1")
val p2 = Message("ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2")
val records = Seq(p1, p2)
val df = spark.createDataFrame(records)
df.withColumn("text-explode", explode(udf_iterations(col("text"))))
.withColumn("text-transformed", udf_transformation(col("text-explode")))
.show(false)
+---------------------------------------+-------------+----------------+
|text |text-explode |text-transformed|
+---------------------------------------+-------------+----------------+
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1|AAA=valAAA1 |valAAA1: AAA |
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1| BBB=valBBB1 |valBBB1: BBB |
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1| CCC=valCCC1 |valCCC1: CCC |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2|ZZZ=valZZZ2 |valZZZ2: ZZZ |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2| AAA=valAAA2 |valAAA2: AAA |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2| BBB=valBBB2 |valBBB2: BBB |
+---------------------------------------+-------------+----------------+
你能发布一下rdd.take(5)的样子吗?字符串看起来很奇怪。
res3:Array[string]=Array({{Infobox officeholder | name=Abraham Lincoln |前任=[[James Buchanan]]]|继任者=[[Andrew Johnson]]| office2=伊利诺伊州[[Illinois]第7国会选区]]的[[U.S.众议院]]议员]district | district2=| term _start2=1847年3月4日| term _en…
并且密钥的名称在某些情况下是相同的,我只对将名称的值和出生日期的值连接起来感兴趣。。。