Apache spark 使用apachespark查找频繁的连续数据

Apache spark 使用apachespark查找频繁的连续数据,apache-spark,Apache Spark,如何使用Apache Spark查找连续序列尝试使用初始字符串,将其拆分为不同长度的唯一子序列,然后在其上广播初始序列并过滤匹配的序列。类似这样的东西在火花壳 val s = "AATTGTGTGTGTGATTTTTTAATG" //your string val s_broadcast = sc.broadcast(s) //broadcast version val A = 2 // min length of substring val B = 3 // max length of s

如何使用Apache Spark查找连续序列

尝试使用初始字符串,将其拆分为不同长度的唯一子序列,然后在其上广播初始序列并过滤匹配的序列。类似这样的东西在
火花壳

val s = "AATTGTGTGTGTGATTTTTTAATG" //your string

val s_broadcast = sc.broadcast(s) //broadcast version

val A = 2 // min length of substring
val B = 3 // max length of substring
val C = 3 // min support
val L = s.size //length of the string

sc.parallelize(
    for{
        i <- A to B
        j <- 0 to (L - i)
    } yield (j,i+j)
) // generating paris of substrings
.map{case(j,i)=>s_broadcast.value.substring(j,i)}
.distinct // if optimization is needed, this step is a place to start
.filter(x=>s_broadcast.value.indexOf(x*C)>=0)
.collect
.map(_*C)

S
的最大长度是多少?
L
的最大值和最小支持值是多少?感谢此代码的输出是数组(TT,GT,TG),并且不是输入字符串s=aattgtgtgtgtgattttaatg的连续子序列。添加
.map(*C)
作为最后一行非常感谢,工作非常好,但是在尝试使用val s=sc.textFile(“file:///tmp/input.txt)和带有val s_broadcast=sc.broadcast(“file:///tmp/input.txt),则给出错误:java.lang.StringIndexOutOfBoundsException:字符串索引超出范围:在点:.map{case(j,i)=>s_broadcast.value.substring(j,i)}在代码中。我将感谢您帮助修复错误。我已经能够使用以下代码清除错误:val seq=sc.textFile(“file:///tmp/input.txt“”)val fseq=seq.flatMap(u.toString).collect.mkString(“”)和val s_broadcast=sc.broadcast(fseq)//广播版。如果有更好的方法,请告诉我。我可以在算法中引入汉明距离,这样频繁的连续子序列就不会是字符串的精确搜索。例如,给定AATGTTGTAGGGTTCCTA,如果汉明距离为1,长度为3,最小支持度为2,则GTTGTA将是连续的频繁序列。我在这方面已经有一段时间了,我将感谢你的意见。谢谢
val s = "AATTGTGTGTGTGTGATTTTTTAATG" //your string

val s_broadcast = sc.broadcast(s) //broadcast version

val A = 2 // min length of substring
val B = 3 // max length of substring
val C = 3 // min support
val L = s.size //length of the string

sc.parallelize(
    for{
        i <- A to B
        j <- 0 to (L - i)
    } yield (j,i+j)
) // generating paris of substrings
.map{case(j,i)=>s_broadcast.value.substring(j,i)}
.distinct // if optimization is needed, this step is a place to start
.flatMap(x=>
    for{
        v <- C to L/A
    } yield x->v
) //making "AB"->3 pairs, which will result in search for "ABABAB"
.filter{case(x,v)=>s_broadcast.value.indexOf(x*v)>=0}
.groupByKey //grouping same substrings of different length
.map{case(k,v)=>k->v.max} //getting longer substring
.collect //bringing substring to the driver
.map{case(k,v)=>k*v}