Scala 使用文本文件第一行的一部分作为RDD中的键_Scala_Apache Spark_Rdd

Scala 使用文本文件第一行的一部分作为RDD中的键

scala apache-spark

Scala 使用文本文件第一行的一部分作为RDD中的键,scala,apache-spark,rdd,Scala,Apache Spark,Rdd,我有一个数据集，由几个名为“01”到“15”的不同文件夹组成，每个文件夹中都包含名为“00-00.txt”到“23-59.txt”的文件（每个文件夹描述一天）在文件中，我有如下行：；（每个以！AIVDM开头的条目都是一行，但第一个条目除外，它以数字开头） 1443650400.010568！AIVDM，1,1,B，15NOHL0P00J@uq6>h8Jr6？vN2>RG7kDCm1iW0088i，0*23 !AIVDM，1,1,A，23aIhd@P1@PHRwPM假设每个文件足够小，可以包含

我有一个数据集，由几个名为“01”到“15”的不同文件夹组成，每个文件夹中都包含名为“00-00.txt”到“23-59.txt”的文件（每个文件夹描述一天）

在文件中，我有如下行：；（每个以

！AIVDM

开头的条目都是一行，但第一个条目除外，它以数字开头）

1443650400.010568！AIVDM，1,1,B，15NOHL0P00J@uq6>h8Jr6？vN2>RG7kDCm1iW0088i，0*23
!AIVDM，1,1,A，23aIhd@P1@PHRwPM假设每个文件足够小，可以包含在单个RDD记录中（不超过2GB），可以使用SparkContext.wholeTextFiles
将每个文件读入单个记录，然后使用flatMap
这些记录：
// assuming data/ folder contains folders 00, 01, ..., 15
val result: RDD[(String, String)] = sc.wholeTextFiles("data/*").values.flatMap(file => {
  val lines = file.split("\n")
  val id = lines.head.split(" ").head
  lines.tail.map((id, _))
})

或者，如果该假设不正确（每个单独的文件可能很大，即数百MB或更多），您需要更加努力地工作：将所有数据加载到单个RDD中，向数据添加索引，收集每个索引的“键”映射，然后使用这些索引为每个数据行找到正确的键：
// read files and zip with index to later match each data line to its key
val raw: RDD[(String, Long)] = sc.textFile("data/*").zipWithIndex().cache()

// separate data from ID rows 
val dataRows: RDD[(String, Long)] = raw.filter(_._1.startsWith("!AIVDM"))
val idRows: RDD[(String, Long)] = raw.filter(!_._1.startsWith("!AIVDM"))

// collect a map if Index -> ID
val idForIndex = idRows.map { case (row, index) => (index, row.split(" ").head) }.collectAsMap()

// optimization: if idForIndex is very large - consider broadcasting it or not collecting it and using a join

// map each row to its key by looking up the MAXIMUM index which is < then row index 
// in other words - find the LAST id record BEFORE the row
val result = dataRows.map { case (row, index) =>
  val key = idForIndex.filterKeys(_ < index).maxBy(_._1)._2
  (key, row)
}

//读取文件并使用索引压缩，以便以后将每个数据行与其键匹配
val raw:RDD[（字符串，长）]=sc.textFile（“数据/*”）.zipWithIndex（）.cache（）
//将数据与ID行分开
val dataRows:RDD[（字符串，长）]=raw.filter（u._1.startsWith（“！AIVDM”））
val idRows:RDD[（字符串，长）]=原始过滤器（！\uu.\u 1.startsWith（“！AIVDM”））
//如果索引->ID，则收集地图
val idForIndex=idRows.map{case（row，index）=>（index，row.split（“”.head））.collectAsMap（）
//优化：如果IDFoReX非常大，考虑广播或不收集它并使用连接。
//通过查找最大索引，将每一行映射到它的键，该索引小于行索引
//换句话说-查找行之前的最后一个id记录
val result=dataRows.map{case（行，索引）=>
val key=idForIndex.filterKeys（u您希望如何处理以开头的其余记录！AIVDM？我希望这个数字是所有这些数字的键，所以得到的rdd应该是这样的：（1443650400.010568，！AIVDM，1,1，，B，15NOHL0P00J@uq6>h8Jr6？vN2>很高兴它有帮助-请接受/投票回答，让其他读者知道它有帮助：）
// read files and zip with index to later match each data line to its key
val raw: RDD[(String, Long)] = sc.textFile("data/*").zipWithIndex().cache()

// separate data from ID rows 
val dataRows: RDD[(String, Long)] = raw.filter(_._1.startsWith("!AIVDM"))
val idRows: RDD[(String, Long)] = raw.filter(!_._1.startsWith("!AIVDM"))

// collect a map if Index -> ID
val idForIndex = idRows.map { case (row, index) => (index, row.split(" ").head) }.collectAsMap()

// optimization: if idForIndex is very large - consider broadcasting it or not collecting it and using a join

// map each row to its key by looking up the MAXIMUM index which is < then row index 
// in other words - find the LAST id record BEFORE the row
val result = dataRows.map { case (row, index) =>
  val key = idForIndex.filterKeys(_ < index).maxBy(_._1)._2
  (key, row)
}