Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/backbone.js/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Dataframe和相应的RDD返回不同的行(PySpark)_Apache Spark_Pyspark_Rdd_Spark Dataframe_Pyspark Sql - Fatal编程技术网

Apache spark Dataframe和相应的RDD返回不同的行(PySpark)

Apache spark Dataframe和相应的RDD返回不同的行(PySpark),apache-spark,pyspark,rdd,spark-dataframe,pyspark-sql,Apache Spark,Pyspark,Rdd,Spark Dataframe,Pyspark Sql,我面临着一种奇怪的行为,数据帧和从RDD等价物生成的下游列表和映射似乎返回了不同的行。可能会出什么问题?感谢您的帮助 以下是代码片段和输出: samples是一个包含10行和3列的数据帧(从另一个较大的数据帧subset_df中随机抽取10行)。稍后,我连接前两列 详细代码如下所示。我转储了数据帧、生成的计数的键值映射,最后是基于数据帧的RDD处理版本。理想情况下,它们都应该包含相同的URL集。但它们是不同的。我知道顺序是否不同(因为对rdd执行.collect()可能会产生不同的顺序),但返回

我面临着一种奇怪的行为,数据帧和从RDD等价物生成的下游列表和映射似乎返回了不同的行。可能会出什么问题?感谢您的帮助

以下是代码片段和输出:

  • samples
    是一个包含10行和3列的数据帧(从另一个较大的数据帧
    subset_df
    中随机抽取10行)。稍后,我连接前两列
  • 详细代码如下所示。我转储了数据帧、生成的计数的键值映射,最后是基于数据帧的RDD处理版本。理想情况下,它们都应该包含相同的URL集。但它们是不同的。我知道顺序是否不同(因为对rdd执行
    .collect()
    可能会产生不同的顺序),但返回的某些行是完全不同的。例如:第三个输出似乎生成了几个URL,这些URL在生成rdd的数据帧中从未存在过。这看起来真的很奇怪 完整代码:

    samples = subset_df.select("post_visid_low", "post_visid_high", "post_page_url").where( 
            subset_df["post_page_url"] != "").sample(False, 0.1, seed=0).limit(num_samples) 
    
    tmp = samples.select(func.concat(func.col("post_visid_low"), func.lit("-"), func.col("post_visid_high")).alias( 
            'user_id'), "post_page_url") 
    print("tmp show:") 
    tmp.show(10, False)
    
    # term freq computation 
    vocab = tmp.select("post_page_url").groupBy("post_page_url").count().rdd.collectAsMap() 
    for k,v in vocab.items(): 
        print(k,v)
    
    
    # group by user_ids 
    user_id_urls = tmp.rdd.reduceByKey( 
        lambda x,y: x + "," + y) 
    num_users = user_id_urls.count() 
    print("user_id_urls:") 
    user_id_urls.collect()
    
    输出:

    tmp数据帧显示():

    vocab地图:

    http://www.backcountry.com/boys-jackets 2 
    http://www.backcountry.com/dakine-titan-mittens 1 
    https://www.backcountry.com/Store/account/account.jsp 1 
    http://www.backcountry.com/ski-clothing 1 
    http://www.backcountry.com/the-north-face-runners-1-etip-glove 1 
    http://www.backcountry.com/patagonia 1 
    http://www.backcountry.com/burton-boys-clothing 1 
    http://www.backcountry.com/mens-shorts 1 
    https://www.backcountry.com/Store/account/login.jsp 1
    
    用户id URL rdd:

    [(u'4611687717086954899-2907911088913069555', 
      u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'), 
     (u'2023386797-562458996', u'http://www.backcountry.com'), 
     (u'6917530783747871522-2923626095076314968', 
      u'http://www.backcountry.com/pikolinos-verona-boot-womens'), 
     (u'6917530818644021208-2821777435347267515', 
      u'http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'), 
     (u'6917530152391623611-2707424459370863148', 
      u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'), 
     (u'6917530609264617841-2788188800375174579', 
      u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'), 
     (u'1657310128-1262694438', 
      u'http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016')] 
    

    不确定,如果这是一个类似于中提到的问题:假设什么等于什么?你似乎对很多数据做了很多事情。顺便说一下,你的reducebykey似乎有问题,尽管我不知道你对它的期望是什么。不确定,这是否是一个类似于中提到的问题:假设什么等于什么?你似乎对很多数据做了很多事情。顺便说一句,你的reducebykey似乎有问题,尽管我不知道你对它有什么期待。
    [(u'4611687717086954899-2907911088913069555', 
      u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'), 
     (u'2023386797-562458996', u'http://www.backcountry.com'), 
     (u'6917530783747871522-2923626095076314968', 
      u'http://www.backcountry.com/pikolinos-verona-boot-womens'), 
     (u'6917530818644021208-2821777435347267515', 
      u'http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'), 
     (u'6917530152391623611-2707424459370863148', 
      u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'), 
     (u'6917530609264617841-2788188800375174579', 
      u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'), 
     (u'1657310128-1262694438', 
      u'http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016')]