Cassandra Spark连接器:分区使用和性能问题

Cassandra Spark连接器:分区使用和性能问题,cassandra,apache-spark,datastax-enterprise,datastax,Cassandra,Apache Spark,Datastax Enterprise,Datastax,我正在尝试运行一个spark作业(与Cassandra对话)来读取数据,进行一些聚合,然后将聚合写入Cassandra 我有两个表格(每月活跃用户(MAU)、每日用户(DAMA)) 对于毛的每一项记录,杜马将有一项或多项记录 获取MAU中的所有记录并获取其中的用户id,然后在杜马中为该用户查找记录(应用服务器端筛选器,如('ms','md')中的度量值名称) 如果指定where子句在DUMA中有一个或多个记录,那么我需要增加AppMAU聚合映射的计数(app-wise MAU计数) 我测试了这

我正在尝试运行一个spark作业(与Cassandra对话)来读取数据,进行一些聚合,然后将聚合写入Cassandra

  • 我有两个表格(每月活跃用户(MAU)、每日用户(DAMA))
  • 对于毛的每一项记录,杜马将有一项或多项记录
  • 获取MAU中的所有记录并获取其中的用户id,然后在杜马中为该用户查找记录(应用服务器端筛选器,如('ms','md')中的度量值名称)
  • 如果指定where子句在DUMA中有一个或多个记录,那么我需要增加AppMAU聚合映射的计数(app-wise MAU计数)
  • 我测试了这个算法,效果如预期,但我想找出答案
1) 这是一个优化的算法(或者)有没有更好的方法?我有一种感觉,有些事情是不正确的,我没有看到加速。看起来正在创建Cassandra客户端,并关闭每个spark操作(收集)。处理小数据集需要很长时间

2) Spark worker与cassandra不在同一位置,这意味着Spark worker运行在不同于C*节点的节点(容器)中(我们可以将Spark worker移动到C*节点以获得数据位置)

3) 我看到正在为每个spark操作(collect)创建/提交spark作业,我相信这是spark的预期行为,是否有任何方法可以减少从C*读取并创建联接,以便快速检索数据

4) 这种算法的缺点是什么?您能推荐更好的设计方法吗,即w/r/t分区策略、将C*分区加载到Spark分区、执行器/驱动程序的内存需求

5) 只要算法和设计方法很好,我就可以使用spark调优。我使用5个工人(每个工人有16个CPU和64GB RAM)

C*模式: MAU: 数据: 杜马: 数据: 火花工作:
导入java.net.InetAddress
导入java.util.concurrent.AtomicLong
导入java.util.{Date,UUID}
导入com.datastax.spark.connector.util.Logging
导入org.apache.spark.{SparkConf,SparkContext}
导入org.joda.time.{DateTime,DateTimeZone}
导入scala.collection.mutable.ListBuffer
对象MonthlyActivitySeraggregate使用日志扩展应用程序{
val KeySpace:String=“分析”
val MauTable:String=“mau”
val CassandraHostProperty=“CASSANDRA_主机”
val CassandraDefaultHost=“127.0.0.1”
val CassandraHost=InetAddress.getByName(sys.env.getOrElse(CassandraHostProperty,CassandraDefaultHost))
val conf=new SparkConf().setAppName(getClass.getSimpleName)
.set(“spark.cassandra.connection.host”,CassandraHost.getHostAddress)
lazy val sc=新的SparkContext(conf)
导入com.datastax.spark.connector.\u
def now=新的日期时间(DateTimeZone.UTC)
val metricMonth=now.getYear+“-”+now.getMonthOfYear
private val mauMonthSB:StringBuilder=新StringBuilder
mauMonthSB.append(now.getYear.append)(“-”)
如果(now.getMonthOfYear<10)mauMonthSB.append(“0”)
mauMonthSB.append(现在是.getMonthOfYear.append(“-”)
如果(now.getDayOfMonth<10)mauMonthSB.append(“0”)
mauMonthSB.append(now.getDayOfMonth)
private val maumount:String=mauMonthSB.toString()
val dates=ListBuffer[String]()
(天)
val duma=sc.cassandraTable[DUMARecord](“分析”、“每日用户指标聚合”)
.where(“metric_date in?和user_id=?和metric_name in?”,dates,monthlyActiveUser.userId,metricName)
//.map(u.userId).distinct().collect()
.collect()
如果(duma.length>0){//如果用户在给定月份有'ms'
如果(!appmauggregate.isDefinedAt(maumount)){
appmauggregate+=(maumount->scala.collection.mutable.Map[UUID,AtomicLong]())
} 
val monthMap:scala.collection.mutable.Map[UUID,AtomicLong]=appmauggregate(maumount)
如果(!monthMap.isDefinedAt(monthlyActiveUser.appId)){
monthMap+=(monthlyActiveUser.appId->新建原子长度(0))
} 
monthMap(monthlyActiveUser.appId).incrementAndGet()
}否则{
println(s“没有在每日用户聚合中发送的消息:$monthlyActiveUser”)
} 
} 

对于((metricMonth:String,appMauCounts:scala.collection.mutable.Map[UUID,AtomicLong])您的解决方案效率最低。您通过逐个查找每个键来执行连接,避免任何可能的并行化

我从未使用过Cassandra连接器,但我知道它返回RDD。因此您可以这样做:

val mau: RDD[(UUID, MAURecord)] = sc
    .cassandraTable[MAURecord]("analytics", "monthly_active_users") 
    .where("month = ?", metricMonth)
    .map(u => u.userId -> u)  // Key by user ID.
val duma: RDD[(UUID, DUMARecord)] = sc
    .cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates") 
    .where("metric_date in ? metric_name in ?", dates, metricName)
    .map(a => a.userId -> a)  // Key by user ID.
// Count "duma" by key.
val dumaCounts: RDD[(UUID, Long)] = duma.countByKey
// Join to "mau". This drops "mau" entries that have no count
// and "duma" entries that are not present in "mau".
val joined: RDD[(UUID, (MAURecord, Long))] = mau.join(dumaCounts)
// Get per-application counts.
val appCounts: RDD[(UUID, Long)] = joined
    .map { case (u, (mau, count)) => mau.appId -> 1 }
    .countByKey
  • 有一个参数spark.cassandra.connection.keep_alive_ms控制连接打开的时间。请查看文档

  • 如果将Spark Workers与Cassandra节点合并,连接器将利用这一点并适当地创建分区,以便执行器始终从本地节点获取数据

  • 我可以看到在DUMA表中可以做的一些设计改进:MeimeCyDATE似乎不是分区键的最佳选择-考虑制作(UsSeriID,MeimiCyNeX)分区键,因为在这种情况下,您将不必为查询生成日期-您只需要将user_id和metrics_name放入where子句。此外,您可以向主键添加一个月标识符-然后,每个分区将只包含与您希望通过每个查询获取的内容相关的信息

    无论如何,目前正在实施加入Spark Cassandra连接器的功能(请参阅)

    cqlsh:analytics> select * from monthly_active_users limit 2;   
     month  | app_id                               | user_id 
    
    --------+--------------------------------------+-------------------------------------- 
     2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 
     2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 2c70a31a-031c-4dbf-8dbd-e2ce7bdc2bc7 
    
    CREATE TABLE analytics.daily_user_metric_aggregates ( 
        metric_date timestamp, 
        user_id uuid,
        metric_name text, 
        "count" counter, 
        PRIMARY KEY (metric_date, user_id, metric_name)
    ) WITH CLUSTERING ORDER BY (user_id ASC, metric_name ASC) 
    
    cqlsh:analytics> select * from daily_user_metric_aggregates where metric_date='2015-02-08' and user_id=199c0a31-8e74-46d9-9b3c-04f67d58b4d1; 
     metric_date | user_id                                                         | metric_name       | count 
    --------------------------+--------------------------------------+-------------------+------- 
     2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | md                      |     1     
     2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | ms                      |     1 
    
    import java.net.InetAddress 
    import java.util.concurrent.atomic.AtomicLong 
    import java.util.{Date, UUID} 
    
    import com.datastax.spark.connector.util.Logging 
    import org.apache.spark.{SparkConf, SparkContext} 
    import org.joda.time.{DateTime, DateTimeZone} 
    
    import scala.collection.mutable.ListBuffer 
    
    object MonthlyActiveUserAggregate extends App with Logging { 
    
        val KeySpace: String = "analytics" 
        val MauTable: String = "mau" 
    
        val CassandraHostProperty = "CASSANDRA_HOST" 
        val CassandraDefaultHost = "127.0.0.1" 
        val CassandraHost = InetAddress.getByName(sys.env.getOrElse(CassandraHostProperty, CassandraDefaultHost)) 
    
        val conf = new SparkConf().setAppName(getClass.getSimpleName) 
            .set("spark.cassandra.connection.host", CassandraHost.getHostAddress) 
    
        lazy val sc = new SparkContext(conf) 
        import com.datastax.spark.connector._ 
    
        def now = new DateTime(DateTimeZone.UTC) 
        val metricMonth = now.getYear + "-" + now.getMonthOfYear 
    
        private val mauMonthSB: StringBuilder = new StringBuilder 
        mauMonthSB.append(now.getYear).append("-") 
        if (now.getMonthOfYear < 10) mauMonthSB.append("0") 
        mauMonthSB.append(now.getMonthOfYear).append("-") 
        if (now.getDayOfMonth < 10) mauMonthSB.append("0") 
        mauMonthSB.append(now.getDayOfMonth) 
    
        private val mauMonth: String = mauMonthSB.toString() 
    
        val dates = ListBuffer[String]() 
        for (day <- 1 to now.dayOfMonth().getMaximumValue) { 
            val metricDate: StringBuilder = new StringBuilder 
            metricDate.append(now.getYear).append("-") 
            if (now.getMonthOfYear < 10) metricDate.append("0") 
            metricDate.append(now.getMonthOfYear).append("-") 
            if (day < 10) metricDate.append("0") 
            metricDate.append(day) 
            dates += metricDate.toString() 
        } 
    
        private val metricName: List[String] = List("ms", "md") 
        val appMauAggregate = scala.collection.mutable.Map[String, scala.collection.mutable.Map[UUID, AtomicLong]]() 
    
        case class MAURecord(month: String, appId: UUID, userId: UUID) extends Serializable 
        case class DUMARecord(metricDate: Date, userId: UUID, metricName: String) extends Serializable 
        case class MAUAggregate(month: String, appId: UUID, total: Long) extends Serializable 
    
        private val mau = sc.cassandraTable[MAURecord]("analytics", "monthly_active_users") 
            .where("month = ?", metricMonth) 
            .collect() 
    
        mau.foreach { monthlyActiveUser => 
            val duma = sc.cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates") 
                .where("metric_date in ? and user_id = ? and metric_name in ?", dates, monthlyActiveUser.userId, metricName) 
                //.map(_.userId).distinct().collect() 
                .collect() 
    
            if (duma.length > 0) { // if user has `ms` for the given month 
                if (!appMauAggregate.isDefinedAt(mauMonth)) { 
                    appMauAggregate += (mauMonth -> scala.collection.mutable.Map[UUID, AtomicLong]()) 
                } 
                val monthMap: scala.collection.mutable.Map[UUID, AtomicLong] = appMauAggregate(mauMonth) 
                if (!monthMap.isDefinedAt(monthlyActiveUser.appId)) { 
                    monthMap += (monthlyActiveUser.appId -> new AtomicLong(0)) 
                } 
                monthMap(monthlyActiveUser.appId).incrementAndGet() 
            } else { 
                println(s"No message_sent in daily_user_metric_aggregates for user: $monthlyActiveUser") 
            } 
    
        } 
        for ((metricMonth: String, appMauCounts: scala.collection.mutable.Map[UUID, AtomicLong]) <- appMauAggregate) { 
            for ((appId: UUID, total: AtomicLong) <- appMauCounts) { 
                println(s"month: $metricMonth, app_id: $appId, total: $total"); 
                val collection = sc.parallelize(Seq(MAUAggregate(metricMonth.substring(0, 7), appId, total.get()))) 
                collection.saveToCassandra(KeySpace, MauTable, SomeColumns("month", "app_id", "total")) 
            } 
        } 
        sc.stop() 
    }
    
    val mau: RDD[(UUID, MAURecord)] = sc
        .cassandraTable[MAURecord]("analytics", "monthly_active_users") 
        .where("month = ?", metricMonth)
        .map(u => u.userId -> u)  // Key by user ID.
    val duma: RDD[(UUID, DUMARecord)] = sc
        .cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates") 
        .where("metric_date in ? metric_name in ?", dates, metricName)
        .map(a => a.userId -> a)  // Key by user ID.
    // Count "duma" by key.
    val dumaCounts: RDD[(UUID, Long)] = duma.countByKey
    // Join to "mau". This drops "mau" entries that have no count
    // and "duma" entries that are not present in "mau".
    val joined: RDD[(UUID, (MAURecord, Long))] = mau.join(dumaCounts)
    // Get per-application counts.
    val appCounts: RDD[(UUID, Long)] = joined
        .map { case (u, (mau, count)) => mau.appId -> 1 }
        .countByKey