Apache spark 使用Spark数据集聚合贴图值的贴图
我在cassandra中有以下映射格式的columnfamily,我想使用Spark数据集处理它。因此,我想将模型的价值分为两类:高级(Apache spark 使用Spark数据集聚合贴图值的贴图,apache-spark,rdd,apache-spark-dataset,apache-spark-2.0,Apache Spark,Rdd,Apache Spark Dataset,Apache Spark 2.0,我在cassandra中有以下映射格式的columnfamily,我想使用Spark数据集处理它。因此,我想将模型的价值分为两类:高级(City和Duster)和非高级(Alto K10,Aspire,Nano和i10),我想要高级和非高级的价值的最终计数,即2(City和Duster计数)和10(Alto K10,Aspire,Nano和i10) 代码: case class UserProfile(userdata:Map[String,Map[String,Int]]) val u
City和Duster
)和非高级(Alto K10,Aspire,Nano和i10
),我想要高级和非高级的价值的最终计数,即2(City
和Duster
计数)和10(Alto K10,Aspire,Nano和i10
)
代码:
case class UserProfile(userdata:Map[String,Map[String,Int]])
val userprofileDataSet = spark.read.format("org.apache.spark.sql.cassandra").options(Map("table"->"userprofilesagg","keyspace" -> "KEYSPACENAME")).load().as[UserProfile]
{'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
case class UserProfile(userid:String, userdata:Map[String,Map[String,Int]])
DOICvncGKUH9xBLnW3e9jXcd2 | {'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
BkkpgeAdCkYJEXsdZjiVz3bSb | {'bodystyle': {'Compact Sedan': 7, 'Hatchback': 5, 'SUV': 3, 'Sedan': 7},
'models': {'Alto K10': 1, 'Aspire': 7, 'City': 4, 'Duster': 1, 'Nano': 8, 'i10': 1}}
如何处理userprofileDataSet
数据格式:
case class UserProfile(userdata:Map[String,Map[String,Int]])
val userprofileDataSet = spark.read.format("org.apache.spark.sql.cassandra").options(Map("table"->"userprofilesagg","keyspace" -> "KEYSPACENAME")).load().as[UserProfile]
{'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
case class UserProfile(userid:String, userdata:Map[String,Map[String,Int]])
DOICvncGKUH9xBLnW3e9jXcd2 | {'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
BkkpgeAdCkYJEXsdZjiVz3bSb | {'bodystyle': {'Compact Sedan': 7, 'Hatchback': 5, 'SUV': 3, 'Sedan': 7},
'models': {'Alto K10': 1, 'Aspire': 7, 'City': 4, 'Duster': 1, 'Nano': 8, 'i10': 1}}
编辑的问题:
case class UserProfile(userdata:Map[String,Map[String,Int]])
val userprofileDataSet = spark.read.format("org.apache.spark.sql.cassandra").options(Map("table"->"userprofilesagg","keyspace" -> "KEYSPACENAME")).load().as[UserProfile]
{'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
case class UserProfile(userid:String, userdata:Map[String,Map[String,Int]])
DOICvncGKUH9xBLnW3e9jXcd2 | {'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
BkkpgeAdCkYJEXsdZjiVz3bSb | {'bodystyle': {'Compact Sedan': 7, 'Hatchback': 5, 'SUV': 3, 'Sedan': 7},
'models': {'Alto K10': 1, 'Aspire': 7, 'City': 4, 'Duster': 1, 'Nano': 8, 'i10': 1}}
关于乌贼的回答。我现在要按如下方式汇总每个用户的结果:
DOICvncGKUH9xBLnW3e9jXcd2 | non-premium | [Nano, Alto K10, Aspire, i10] | 12 | premium | [City, Duster] | 2
BkkpgeAdCkYJEXsdZjiVz3bSb | non-premium | [Nano, Alto K10, Aspire, i10] | 17 | premium | [City, Duster] | 5
现在,case类将如下所示
案例类别:
case class UserProfile(userdata:Map[String,Map[String,Int]])
val userprofileDataSet = spark.read.format("org.apache.spark.sql.cassandra").options(Map("table"->"userprofilesagg","keyspace" -> "KEYSPACENAME")).load().as[UserProfile]
{'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
case class UserProfile(userid:String, userdata:Map[String,Map[String,Int]])
DOICvncGKUH9xBLnW3e9jXcd2 | {'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
BkkpgeAdCkYJEXsdZjiVz3bSb | {'bodystyle': {'Compact Sedan': 7, 'Hatchback': 5, 'SUV': 3, 'Sedan': 7},
'models': {'Alto K10': 1, 'Aspire': 7, 'City': 4, 'Duster': 1, 'Nano': 8, 'i10': 1}}
数据:
case class UserProfile(userdata:Map[String,Map[String,Int]])
val userprofileDataSet = spark.read.format("org.apache.spark.sql.cassandra").options(Map("table"->"userprofilesagg","keyspace" -> "KEYSPACENAME")).load().as[UserProfile]
{'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
case class UserProfile(userid:String, userdata:Map[String,Map[String,Int]])
DOICvncGKUH9xBLnW3e9jXcd2 | {'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
BkkpgeAdCkYJEXsdZjiVz3bSb | {'bodystyle': {'Compact Sedan': 7, 'Hatchback': 5, 'SUV': 3, 'Sedan': 7},
'models': {'Alto K10': 1, 'Aspire': 7, 'City': 4, 'Duster': 1, 'Nano': 8, 'i10': 1}}
此外,你还问我为什么提到身体风格。因此,我可以将类似的聚合
(SUV、Sedan)
作为特优应用,而将非特优应用于它 我不确定bodystyle
的角色到底是什么。如果我正确理解了这个问题,那么您需要分类和计数,您可以尝试下面的方法,如果没有用,可以删除类型
:
--userprofile table
CREATE TABLE `userprofile`(
`properties` map<string,map<string,int>>);
--Aggregate by category
select category,
collect_set(type) as types,
sum(value) as count
from (select case when lower(type) in ('city','duster') then 'premium'
when lower(type) in ('alto k10', 'aspire', 'nano' , 'i10') then 'non-premium'
end as category,
type,value
from (select properties['models'] as models from userprofile) t
lateral view explode(models) t as type, value)l group by category
是否只处理模型?体型的作用是什么?@squid我编辑了这个问题。请你调查一下好吗。