Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark和Scala:对RDD的每个元素应用一个函数_Scala_Apache Spark - Fatal编程技术网

Spark和Scala:对RDD的每个元素应用一个函数

Spark和Scala:对RDD的每个元素应用一个函数,scala,apache-spark,Scala,Apache Spark,我有一个VertexRDD[(VertexId,Long)]的RDD,结构如下: (533, 1) (571, 2) (590, 0) ... 其中,每个元素由顶点id(533、571、590等)及其输出边数(1、2、0等)组成 我想对这个RDD的每个元素应用一个函数。此函数必须在输出边数和4个阈值之间执行比较 如果输出边的数量小于或等于4个阈值中的一个,则必须将相应的顶点id插入数组(或其他类似的数据结构),以便在末尾获得4个数据结构,每个数据结构包含满足与相应阈值比较的顶点id 我需要在相

我有一个VertexRDD[(VertexId,Long)]的RDD,结构如下:

(533, 1)
(571, 2)
(590, 0)
...
其中,每个元素由顶点id(533、571、590等)及其输出边数(1、2、0等)组成

我想对这个
RDD
的每个元素应用一个函数。此函数必须在输出边数和4个阈值之间执行比较

如果输出边的数量小于或等于4个阈值中的一个,则必须将相应的顶点id插入
数组
(或其他类似的数据结构),以便在末尾获得4个数据结构,每个数据结构包含满足与相应阈值比较的顶点id

我需要在相同的数据结构中累积满足相同阈值比较的ID。我如何使用
Spark
Scala
并行并实现这种方法

我的代码:

val usersGraphQuery = "MATCH (u1:Utente)-[p:PIU_SA_DI]->(u2:Utente) RETURN id(u1), id(u2), type(p)"
val usersGraph = neo.rels(usersGraphQuery).loadGraph[Any, Any]
val numUserGraphNodes = usersGraph.vertices.count
val numUserGraphEdges = usersGraph.edges.count
val maxNumOutDegreeEdgesPerNode = numUserGraphNodes - 1

// get id and number of outgoing edges of each node from the graph
// except those that have 0 outgoing edges (default behavior of the outDegrees API)
var userNodesOutDegreesRdd: VertexRDD[Int] = usersGraph.outDegrees

/* userNodesOutDegreesRdd.foreach(println) 
 * Now you can see 
 *  (533, 1)
 *  (571, 2)
 */

// I also get ids of nodes with zero outgoing edges
var fixedGraph: Graph[Any, Any] = usersGraph.outerJoinVertices(userNodesOutDegreesRdd)( (vid: Any, defaultOutDegrees: Any, outDegOpt: Option[Any]) => outDegOpt.getOrElse(0L) )
var completeUserNodesOutDregreesRdd = fixedGraph.vertices

/* completeUserNodesOutDregreesRdd.foreach(println) 
* Now you can see 
*  (533, 1)
*  (571, 2)
*  (590, 0) <--
*/

// 4 thresholds that identify the 4 clusters of User nodes based on the number of their outgoing edges 
var soglia25: Double = (maxNumOutDegreeEdgesPerNode.toDouble/100)*25
var soglia50: Double = (maxNumOutDegreeEdgesPerNode.toDouble/100)*50
var soglia75: Double = (maxNumOutDegreeEdgesPerNode.toDouble/100)*75
var soglia100: Double = maxNumOutDegreeEdgesPerNode
println("soglie: "+soglia25+", "+soglia50+", "+soglia75+", "+soglia100)

// containers of individual clusters
var lowSAUsers = new ListBuffer[(Long, Any)]()
var mediumLowSAUsers = new ListBuffer[(Long, Any)]()
var mediumHighSAUsers = new ListBuffer[(Long, Any)]()
var highSAUsers = new ListBuffer[(Long, Any)]()
// overall container of the 4 clusters
var clustersContainer = new ListBuffer[ (String, ListBuffer[(Long, Any)]) ]()

// I WANT PARALLEL FROM HERE -----------------------------------------------
// from RDD to Array
var completeUserNodesOutDregreesArray = completeUserNodesOutDregreesRdd.take(numUserGraphNodes.toInt)

// analizzo ogni nodo Utente e lo assegno al cluster di appartenenza
for(i<-0 to numUserGraphNodes.toInt-1) { 
  // confronto il valore del numero di archi in uscita (convertito in stringa) 
  // con le varie soglie per determinare in quale classe inserire il relativo nodo Utente 
  if( (completeUserNodesOutDregreesArray(i)._2).toString().toLong <= soglia25 ) {
    println("ok soglia25 ")
    lowSAUsers += completeUserNodesOutDregreesArray(i)
  }else if( (completeUserNodesOutDregreesArray(i)._2).toString().toLong <= soglia50 ){
    println("ok soglia50 ")
    mediumLowSAUsers += completeUserNodesOutDregreesArray(i)
  }else if( (completeUserNodesOutDregreesArray(i)._2).toString().toLong <= soglia75 ){
    println("ok soglia75 ")
    mediumHighSAUsers += completeUserNodesOutDregreesArray(i)
  }else if( (completeUserNodesOutDregreesArray(i)._2).toString().toLong <= soglia100 ){
    println("ok soglia100 ")
    highSAUsers += completeUserNodesOutDregreesArray(i)
  }

} 

// I put each cluster in the final container
clustersContainer += Tuple2("lowSAUsers", lowSAUsers)
clustersContainer += Tuple2("mediumLowSAUsers", mediumLowSAUsers)
clustersContainer += Tuple2("mediumHighSAUsers", mediumHighSAUsers)
clustersContainer += Tuple2("highSAUsers", highSAUsers)

/* clustersContainer.foreach(println) 
 * Now you can see 
 * (lowSAUsers,ListBuffer((590,0)))
 * (mediumLowSAUsers,ListBuffer((533,1)))
 * (mediumHighSAUsers,ListBuffer())
 * (highSAUsers,ListBuffer((571,2)))
 */

// ---------------------------------------------------------------------
val usersGraphQuery=“匹配(u1:utete)-[p:PIU_SA_DI]->(u2:utete)返回id(u1)、id(u2)、type(p)”
val usersGraph=neo.rels(usersGraphQuery).loadGraph[Any,Any]
val numUserGraphNodes=usersGraph.vertices.count
val numUserGraphEdges=usersGraph.edges.count
val MaxNumoutDegreedgesPerNode=numUserGraphNodes-1
//从图中获取每个节点的id和传出边数
//除了那些有0个输出边的(outDegrees API的默认行为)
var usernodesoutdegreesdd:VertexRDD[Int]=usersGraph.outDegrees
/*UserNodeOutDegreesRdd.foreach(println)
*现在你可以看到了
*  (533, 1)
*  (571, 2)
*/
//我还获得了零输出边的节点ID
var fixedGraph:Graph[Any,Any]=用户Graph.outerJoinVertices(UserNodeOutDegreesRdd)((vid:Any,defaultOutDegrees:Any,outDegOpt:Option[Any])=>outDegOpt.getOrElse(0L))
var completeesernodesoutdregreesrdd=fixegraph.vertices
/*完整的用户节点输出权限(打印项次)
*现在你可以看到了
*  (533, 1)
*  (571, 2)

*(590,0)创建一个元组数组来表示不同的存储单元怎么样:

val bins = Seq(0, soglia25, soglia50, soglia75, soglia100).sliding(2)
    .map(seq => (seq(0), seq(1))).toArray
然后为RDD的每个元素找到一个对应的bin,将其设置为键,将id转换为Seq并按键缩减:

def getBin(bins: Array[(Double, Double)], value: Int): Int = { 
   bins.indexWhere {case (a: Double, b: Double) => a < value && b >= value} 
}
userNodesOutDegreesRdd.map { 
    case (id, value) => (getBin(bins, value), Seq(id))
}.reduceByKey(_ ++ _)
def getBin(bin:Array[(Double,Double)],value:Int:Int={
bins.indexWhere{case(a:Double,b:Double)=>a=value}
}
UserNodeOutdegreesRdd.map{
案例(id,value)=>(getBin(bin,value),Seq(id))
}.reduceByKey(+++)

您能显示预期输出和您尝试过的内容吗?@mtoto我已经更新了我的问题完美!!!考虑到我的代码,是否也可以在4个最终列表中获得用户节点的属性(即名称),而不是其id(甚至id和属性)?我的意思是,现在我得到了:(0,List(610590))(1,List(627))(3,List(571))(2,List(533)),但我想得到:(0,List((610,Bob),(590,Fabian))(1,List((627,Chris))(3,List((571,Frank))(2,List((533,Joe)))我不熟悉GraphX,所以我不确定如何提取名称,但想法是用3个值的RDD
(id,name,value)
然后
rdd.map{case(id,name,value)=>(getBin(bin,value),Seq((id,name))
}