Python 如何循环Pyspark RDD中的每一项并将其转换为键?使用地图功能?
首先,我有一些类似的输入:Python 如何循环Pyspark RDD中的每一项并将其转换为键?使用地图功能?,python,apache-spark,pyspark,mapreduce,mapping,Python,Apache Spark,Pyspark,Mapreduce,Mapping,首先,我有一些类似的输入: A:<phone1,phone2>,<location1>,<email1> B:<phone1>,<location2>,<email1,email2> phone1: A:<phone1,phone2>,<location1>,<email1> phone1: B:<phone1>,<location2>,<email1,em
A:<phone1,phone2>,<location1>,<email1>
B:<phone1>,<location2>,<email1,email2>
phone1: A:<phone1,phone2>,<location1>,<email1>
phone1: B:<phone1>,<location2>,<email1,email2>
phone2: A:<phone1,phone2>,<location1>,<email1>
location1: A:<phone1,phone2>,<location1>,<email1>
location2: B:<phone1>,<location2>,<email1,email2>
email1: A:<phone1,phone2>,<location1>,<email1>
email1: B:<phone1>,<location2>,<email1,email2>
email2: B:<phone1>,<location2>,<email1,email2>
A:,,
B:,,
我想使用Pyspark.rdd.map()函数在行中每次循环,并将它们转换为键值对,如下所示:
A:<phone1,phone2>,<location1>,<email1>
B:<phone1>,<location2>,<email1,email2>
phone1: A:<phone1,phone2>,<location1>,<email1>
phone1: B:<phone1>,<location2>,<email1,email2>
phone2: A:<phone1,phone2>,<location1>,<email1>
location1: A:<phone1,phone2>,<location1>,<email1>
location2: B:<phone1>,<location2>,<email1,email2>
email1: A:<phone1,phone2>,<location1>,<email1>
email1: B:<phone1>,<location2>,<email1,email2>
email2: B:<phone1>,<location2>,<email1,email2>
电话1:A:,,
电话1:B:,,
电话2:A:,,
地点1:A:,,
地点二:乙:,,,
电子邮件1:A:,,
电子邮件1:B:,,
电邮2:B:,,
在我以前的尝试中,我试图在map函数内部的lambda函数上添加一个循环,但它不支持。还有别的办法吗 scala>val-rdd=sc.parallelize(顺序(“A:,,,,,,,,,,”))
scala> val rdd = sc.parallelize(Seq("A:<phone1,phone2>,<location1>,<email1>", "B:<phone1>,<location2>,<email1,email2>"))
scala> rdd.foreach(println)
A:<phone1,phone2>,<location1>,<email1>
B:<phone1>,<location2>,<email1,email2>
scala> case class dataclass(c0:String, c1:String)
scala> val df = rdd.map(x => x.split(":")).map(y => dataclass(y(0), y(1))).toDF
scala> df.show(false)
+---+------------------------------------+
|c0 |c1 |
+---+------------------------------------+
|A |<phone1,phone2>,<location1>,<email1>|
|B |<phone1>,<location2>,<email1,email2>|
+---+------------------------------------+
scala> val df1 = df.withColumn("tempCol",regexp_replace(regexp_replace(col("c1"), "<", ""),">", ""))
.withColumn("tempCol", explode(split(col("tempCol"), ",")))
.withColumn("out", concat(col("tempCol"), lit(":"), col("c0"), lit(":"), col("c1")))
.drop("c0", "c1", "tempCol")
scala> df1.show(false)
+------------------------------------------------+
|out |
+------------------------------------------------+
|phone1:A:<phone1,phone2>,<location1>,<email1> |
|phone2:A:<phone1,phone2>,<location1>,<email1> |
|location1:A:<phone1,phone2>,<location1>,<email1>|
|email1:A:<phone1,phone2>,<location1>,<email1> |
|phone1:B:<phone1>,<location2>,<email1,email2> |
|location2:B:<phone1>,<location2>,<email1,email2>|
|email1:B:<phone1>,<location2>,<email1,email2> |
|email2:B:<phone1>,<location2>,<email1,email2> |
+------------------------------------------------+
scala> val rdd2 = df1.rdd.map(_(0))
scala> rdd2.foreach(println)
phone1:A:<phone1,phone2>,<location1>,<email1>
phone2:A:<phone1,phone2>,<location1>,<email1>
location1:A:<phone1,phone2>,<location1>,<email1>
email1:A:<phone1,phone2>,<location1>,<email1>
phone1:B:<phone1>,<location2>,<email1,email2>
location2:B:<phone1>,<location2>,<email1,email2>
email1:B:<phone1>,<location2>,<email1,email2>
email2:B:<phone1>,<location2>,<email1,email2>
scala>rdd.foreach(println)
A:,,
B:,,
scala>案例类dataclass(c0:String,c1:String)
scala>valdf=rdd.map(x=>x.split(“:”).map(y=>dataclass(y(0),y(1)).toDF
scala>df.show(假)
+---+------------------------------------+
|c0 | c1|
+---+------------------------------------+
|A ||
|B ||
+---+------------------------------------+
scala>val df1=df.withColumn(“tempCol”、regexp\u replace(regexp\u replace(col(“c1”)、“”、“”)
.withColumn(“tempCol”),explode(拆分(col(“tempCol”),“,”))
.带列(“out”、concat(col(“tempCol”)、亮(“:”)列(“c0”)、亮(“:”)列(“c1”))
.drop(“c0”、“c1”、“tempCol”)
scala>df1.show(false)
+------------------------------------------------+
|出去|
+------------------------------------------------+
|电话1:A:|
|电话2:A:|
|地点1:A:|
|电子邮件1:A:|
|电话1:B:|
|地点二:乙:,|
|电子邮件1:B:|
|电邮2:B:|
+------------------------------------------------+
scala>val rdd2=df1.rdd.map(0))
scala>rdd2.foreach(println)
电话1:A:,,
电话2:A:,,
地点1:A:,,
电子邮件1:A:,,
电话1:B:,,
地点二:乙:,,,
电子邮件1:B:,,
电邮2:B:,,