Apache spark Dataframe和相应的RDD返回不同的行(PySpark)
我面临着一种奇怪的行为,数据帧和从RDD等价物生成的下游列表和映射似乎返回了不同的行。可能会出什么问题?感谢您的帮助 以下是代码片段和输出:Apache spark Dataframe和相应的RDD返回不同的行(PySpark),apache-spark,pyspark,rdd,spark-dataframe,pyspark-sql,Apache Spark,Pyspark,Rdd,Spark Dataframe,Pyspark Sql,我面临着一种奇怪的行为,数据帧和从RDD等价物生成的下游列表和映射似乎返回了不同的行。可能会出什么问题?感谢您的帮助 以下是代码片段和输出: samples是一个包含10行和3列的数据帧(从另一个较大的数据帧subset_df中随机抽取10行)。稍后,我连接前两列 详细代码如下所示。我转储了数据帧、生成的计数的键值映射,最后是基于数据帧的RDD处理版本。理想情况下,它们都应该包含相同的URL集。但它们是不同的。我知道顺序是否不同(因为对rdd执行.collect()可能会产生不同的顺序),但返回
samples
是一个包含10行和3列的数据帧(从另一个较大的数据帧subset_df
中随机抽取10行)。稍后,我连接前两列李>
.collect()
可能会产生不同的顺序),但返回的某些行是完全不同的。例如:第三个输出似乎生成了几个URL,这些URL在生成rdd的数据帧中从未存在过。这看起来真的很奇怪李>
完整代码:
samples = subset_df.select("post_visid_low", "post_visid_high", "post_page_url").where(
subset_df["post_page_url"] != "").sample(False, 0.1, seed=0).limit(num_samples)
tmp = samples.select(func.concat(func.col("post_visid_low"), func.lit("-"), func.col("post_visid_high")).alias(
'user_id'), "post_page_url")
print("tmp show:")
tmp.show(10, False)
# term freq computation
vocab = tmp.select("post_page_url").groupBy("post_page_url").count().rdd.collectAsMap()
for k,v in vocab.items():
print(k,v)
# group by user_ids
user_id_urls = tmp.rdd.reduceByKey(
lambda x,y: x + "," + y)
num_users = user_id_urls.count()
print("user_id_urls:")
user_id_urls.collect()
输出:
tmp数据帧显示():
vocab地图:
http://www.backcountry.com/boys-jackets 2
http://www.backcountry.com/dakine-titan-mittens 1
https://www.backcountry.com/Store/account/account.jsp 1
http://www.backcountry.com/ski-clothing 1
http://www.backcountry.com/the-north-face-runners-1-etip-glove 1
http://www.backcountry.com/patagonia 1
http://www.backcountry.com/burton-boys-clothing 1
http://www.backcountry.com/mens-shorts 1
https://www.backcountry.com/Store/account/login.jsp 1
用户id URL rdd:
[(u'4611687717086954899-2907911088913069555',
u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'),
(u'2023386797-562458996', u'http://www.backcountry.com'),
(u'6917530783747871522-2923626095076314968',
u'http://www.backcountry.com/pikolinos-verona-boot-womens'),
(u'6917530818644021208-2821777435347267515',
u'http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'),
(u'6917530152391623611-2707424459370863148',
u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
(u'6917530609264617841-2788188800375174579',
u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
(u'1657310128-1262694438',
u'http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016')]
不确定,如果这是一个类似于中提到的问题:假设什么等于什么?你似乎对很多数据做了很多事情。顺便说一下,你的reducebykey似乎有问题,尽管我不知道你对它的期望是什么。不确定,这是否是一个类似于中提到的问题:假设什么等于什么?你似乎对很多数据做了很多事情。顺便说一句,你的reducebykey似乎有问题,尽管我不知道你对它有什么期待。
[(u'4611687717086954899-2907911088913069555',
u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'),
(u'2023386797-562458996', u'http://www.backcountry.com'),
(u'6917530783747871522-2923626095076314968',
u'http://www.backcountry.com/pikolinos-verona-boot-womens'),
(u'6917530818644021208-2821777435347267515',
u'http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'),
(u'6917530152391623611-2707424459370863148',
u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
(u'6917530609264617841-2788188800375174579',
u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
(u'1657310128-1262694438',
u'http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016')]