Scala 如何将RDD[String]转换为RDD[(String,String)]?

Scala 如何将RDD[String]转换为RDD[(String,String)]?,scala,apache-spark,rdd,Scala,Apache Spark,Rdd,我从一个文件中得到了一个RDD[String]: val file = sc.textFile("/path/to/myData.txt") myData的格式: >str1_name ATCGGKFKKVKKFKRLFFVLFLRL FDJKALGFJVKRIKFKVKFGKLRL ... FJDLALLLGL //the last line of str1 >str2_name ATCGGKFKKVKKFKRLFFVLFLRL FDJKALGFJVKRIKFKVKFGKLRL

我从一个文件中得到了一个
RDD[String]

val file = sc.textFile("/path/to/myData.txt")
myData的格式:

>str1_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
...
FJDLALLLGL //the last line of str1
>str2_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
...
FJDLALLLGL //the last line of str2
>str3_name
...
如何将数据从文件转换为结构
RDD[(String,String)]
? 比如说,

trancRDD(
(str1_name, ATCGGKFKKVKKFKRLFFVLFLRLFDJKALGFJVKRIKFKVKFGKLRL), 
(str2_name, ATCGGKFKKVKKFKRLFFVLFLRLFDJKALGFJVKRIKFKVKFGKLRL),
...
)

如果有定义的记录分隔符,如上文所述的“>”,则可以使用自定义Hadoop配置完成此操作:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

val conf = new Configuration
conf.set("textinputformat.record.delimiter", ">")
// genome.txt contains the records provided in the question without the "..."
val dataset = sc.newAPIHadoopFile("./data/genome.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val data = dataset.map(x=>x._2.toString)
让我们看看数据

data.collect
res11: Array[String] = 
Array("", "str1_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
FJDLALLLGL 
", "str2_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
FJDLALLLGL
")
我们可以很容易地用这个字符串做记录

val records =  data.map{ multiLine => val lines = multiLine.split("\n"); (lines.head, lines.tail)}
records.collect
res14: Array[(String, Array[String])] = Array(("",Array()),
       (str1_name,Array(ATCGGKFKKVKKFKRLFFVLFLRL, FDJKALGFJVKRIKFKVKFGKLRL, FJDLALLLGL)),
       (str2_name,Array(ATCGGKFKKVKKFKRLFFVLFLRL, FDJKALGFJVKRIKFKVKFGKLRL, FJDLALLLGL)))

(使用过滤器取出第一条空记录…读者练习)

我们已经使用自定义hadoop输入格式完成了类似的操作,但这并不简单。如果我是你,我宁愿编写一个小程序来将输入转换成Spark友好的格式。因为你想要的转换取决于“当前”之前的元素(前一行以“>”开头),所以实际上不可能跨分区分发(因为前一行>-可能不在分区中)。所以正如@maasg所说,对文件进行一些预处理以使其成为正确的格式会更好。谢谢大家@我找到了一条路!见下面的工作示例。