pyspark内部联接无法解析明显具有
我有两个pyspark数据框tsteval和top_rec。我正在尝试创建一个新的数据框top_rec_tckts,只过滤tsteval中与top_rec相同的storeid和tz_brand_id的记录。因此,我可以从tsteval中获取这些记录的storeid和ticketid。我有下面两个数据帧的示例输出。它们都有storeid和tz_brand_id字段。我不明白为什么我在尝试使用内部联接过滤tsteval时会出现以下错误。有人知道问题是什么吗,或者你能提出另一种方法来实现这一点吗。很抱歉,我不得不把下面的一堆错误信息切掉,以使其符合要求。我留下了开头和结尾,我希望有足够的线索来了解到底发生了什么pyspark内部联接无法解析明显具有,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有两个pyspark数据框tsteval和top_rec。我正在尝试创建一个新的数据框top_rec_tckts,只过滤tsteval中与top_rec相同的storeid和tz_brand_id的记录。因此,我可以从tsteval中获取这些记录的storeid和ticketid。我有下面两个数据帧的示例输出。它们都有storeid和tz_brand_id字段。我不明白为什么我在尝试使用内部联接过滤tsteval时会出现以下错误。有人知道问题是什么吗,或者你能提出另一种方法来实现这一点吗。很抱
tsteval.show(truncate=False)
print('')
top_rec.show(truncate=False)
示例数据:
+----------+----------+
|tz_brand_id|storeid|qty|dateclosed|grossreceipts |ticketid |current_date|filter_date|min_dt |max_dt |
+-----------+-------+---+----------+-------------------+------------------------------------+------------+-----------+----------+----------+
|2847 |87 |1.0|2020-06-15|21.1453375 |02c8ec06-a75a-4dd2-89e2-dbbf1dxxxxxx|2020-07-15 |2020-03-17 |2020-03-17|2020-05-16|
|2847 |87 |1.0|2020-05-23|21.1453375 |67a34306-6608-4b00-bf72-f1f42xxxxxx|2020-07-15 |2020-03-17 |2020-03-17|2020-05-16|
|2847 |87 |1.0|2020-05-19|26.129683025000002 |82665853-66ad-4e52-851e-f1cdf8xxxxxx|2020-07-15 |2020-03-17 |2020-03-17|2020-05-16|
|3285 |127 |1.0|2020-06-02|20.642125 |d0898233-64b3-48d8-9a46-a03eefxxxxxx|2020-07-15 |2020-03-17 |2020-03-17|2020-05-16|
|3285 |127 |1.0|2020-05-22|20.642125 |941d2889-230f-4a19-9cb9-90f7b2xxxxxx|2020-07-15 |2020-03-17 |2020-03-17|2020-05-16|
|2747 |77 |1.0|2020-05-30|21.3902 |72c3c7dd-a436-45ae-9adb-f19618xxxxxx|2020-07-15 |2020-03-17 |2020-03-17|2020-05-16|
|9601 |85 |1.0|2020-05-30|23.0 |74328e66-6371-4323-bdf9-057d2xxxxxx|2020-07-15 |2020-03-17 |2020-03-17|2020-05-16|
|9601 |85 |1.0|2020-05-29|20.7 |997ab6b3-b4b5-48e4-884d-00834xxxxxx|2020-07-15 |2020-03-17 |2020-03-17|2020-05-16|
+-----------+-------+---+----------+-------------------+------------------------------------+------------+-----------+----------+----------+
only showing top 20 rows
+-------+----------+-----------+
|storeid|max_dt |tz_brand_id|
+-------+----------+-----------+
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
|127 |2020-05-16|2799 |
+-------+----------+-----------+
代码:
错误:
An error was encountered:
'Resolved attribute(s) max_dt#6786 missing from storeid#3445,qty#3375,max_dt#3299,min_dt#3289,grossreceipts#3381,filter_date#411,tz_brand_id#3449,ticketid#3387,dateclosed#3390,current_date#403 in operator !Filter ((dateclosed#3390 > min_dt#3289) && (dateclosed#3390 <= max_dt#6786)). Attribute(s) with the same name appear in the operation: max_dt. Please check if the right attribute(s) are used.;;\nJoin Inner, ((storeid#292 = storeid#7081) && (tz_brand_id#296 = tz_brand_id#4560))\n:- SubqueryAlias `a`\n: +- Filter ((dateclosed#237 > max_dt#3299) && (dateclosed#237 <= date_add(max_dt#3299, 30)))\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234, current_date#403, filter_date#411, min_dt#3289, date_add(filter_date#411, 60) AS max_dt#3299]\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234, current_date#403, filter_date#411, date_add(filter_date#411, 0) AS min_dt#3289]\n: +- Filter (dateclosed#237 > filter_date#411)\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234, current_date#403, date_add(current_date#403, -120) AS filter_date#411]\n: +- Filter storeid#292 IN (85,130,77,127,87)\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234, to_date(cast(unix_timestamp(2020-07-15 21:17:18, yyyy-MM-dd, None) as timestamp), None) AS current_date#403]\n: +- Filter isnotnull(tz_brand_id#296)\n: +- Filter NOT (storeid#292 = 230)\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234]\n: +- Filter (producttype#211 = EDIBLE)\n: +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n+- SubqueryAlias `b`\n +- Project [storeid#7081, max_dt#6786, tz_brand_id#4560]\n +- Project [storeid#7081, max_dt#6786, tz_brand_id#7085, prediction#4548, tz_brand_id#4560, tz_brand_id#4560]\n +- Window [first(tz_brand_id#7085, true) windowspecdefinition(storeid#7081, max_dt#6786, prediction#4548 DESC NULLS LAST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS tz_brand_id#4560], [storeid#7081, max_dt#6786], [prediction#4548 DESC NULLS LAST]\n +- Project [storeid#7081, max_dt#6786, tz_brand_id#7085, prediction#4548]\n +- Filter AtLeastNNulls(n, prediction#4548)\n +- Project [storeid#7081, tz_brand_id#7085, max_dt#6786, accepted_date#4366, UDF(features#4493, features#4503) AS prediction#4548]\n +- Join LeftOuter, (UDF(tz_brand_id#7085) = id#4502)\n :- Join LeftOuter, (UDF(storeid#7081) = id#4492)\n : :- Deduplicate [storeid#7081, tz_brand_id#7085, max_dt#6786, accepted_date#4366]\n : : +- Filter ((cast(max_dt#6786 as timestamp) < accepted_date#4366) || isnull(accepted_date#4366))\n : : +- Project [storeid#7081, tz_brand_id#7085, max_dt#6786, accepted_date#4366]\n : : +- Join LeftOuter, ((storeid#7081 = storeid#4102) && (tz_brand_id#7085 = tz_brand_id#4103))\n : : :- SubqueryAlias `a`\n : : : +- Join Cross\n : : : :- Project [storeid#7081, max_dt#6786]\n : : : : +- Project [storeid#7081, max_dt#6786]\n : : : : +- Project [tz_brand_id#7085, min_dt#3289, max_dt#6786, coalesce((brand_qty#3346 / total_qty#3326), cast(0 as double)) AS norm_qty#3472, storeid#7081]\n : : : : +- Join LeftOuter, (storeid#7081 = storeid#3445)\n : : : : :- SubqueryAlias `a`\n : : : : : +- Project [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085, sum(qty)#3339 AS brand_qty#3346]\n : : : : : +- Aggregate [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085], [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085, sum(qty#7011) AS sum(qty)#3339]\n : : : : : +- Filter ((dateclosed#7026 > min_dt#3289) && (dateclosed#7026 <= max_dt#6786))\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, filter_date#411, min_dt#3289, date_add(filter_date#411, 60) AS max_dt#6786]\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, filter_date#411, date_add(filter_date#411, 0) AS min_dt#3289]\n : : : : : +- Filter (dateclosed#7026 > filter_date#411)\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, date_add(current_date#403, -120) AS filter_date#411]\n : : : : : +- Filter storeid#7081 IN (85,130,77,127,87)\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, to_date(cast(unix_timestamp(2020-07-15 21:17:18, yyyy-MM-dd, None) as timestamp), None) AS current_date#403]\n : : : : : +- Filter isnotnull(tz_brand_id#7085)\n : : : : : +- Filter NOT (storeid#7081 = 230)\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023]\n : : : : : +- Filter (producttype#7000 = EDIBLE)\n : : : : : +- LogicalRDD [cbd_perc#6988, thc_perc#6989, register#6990, customer_type#6991, type#6992, customer_state#6993, customer_city#6994, zip_code#6995, age#6996, age_group#6997, cashier#6998, approver#6999, producttype#7000, productsubtype#7001, productattributes#7002, productbrand#7003, productname#7004, classification#7005, tier#7006, weight#7007, unitofmeasure#7008, size#7009, priceunit#7010, qty#7011, ... 75 more fields], false\n : : : : +- SubqueryAlias `b`\n : : : : +- Project [storeid#3445, sum(qty)#3320 AS total_qty#3326]\n : : : : +- !Aggregate [storeid#3445, min_dt#3289, max_dt#6786], [storeid#3445, min_dt#3289, max_dt#6786, sum(qty#3375) AS sum(qty)#3320]\n : : : : +- !Filter ((dateclosed#3390 > min_dt#3289) && (dateclosed#3390 <= max_dt#6786))\n : : : : +- Project [tz_brand_id#3449, storeid#3445, qty#3375, dateclosed#3390, grossreceipts#3381, ticketid#3387, current_date#403, filter_date#411, min_dt#3289, date_add(filter_date#411, 60) AS max_dt#3299]\n : : : : +- Project [tz_brand_id#3449, storeid#3445, [tz_brand_id#7085, max_dt#6786]\n : : : +- Project [tz_brand_id#7085, min_dt#3289, max_dt#6786, coalesce((brand_qty#3346 / total_qty#3326), cast(0 as double)) AS norm_qty#3472, storeid#7081]\n : : : +- Join LeftOuter, (storeid#7081 = storeid#3445)\n : : : :- SubqueryAlias `a`\n : : : : +- Project [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085, sum(qty)#3339 AS brand_qty#3346]\n : : : : +- Aggregate [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085], [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085, sum(qty#7011) AS sum(qty)#3339]\n : : : : +- Filter ((dateclosed#7026 > min_dt#3289) && (dateclosed#7026 <= max_dt#6786))\n : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, filter_date#411, min_dt#3289, date_add(filter_date#411, 60) AS max_dt#6786]\n : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, filter_date#411, date_add(filter_date#411, 0) AS min_dt#3289]\n : : : : +- Filter (dateclosed#7026 > filter_date#411)\n : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, date_add(current_date#403, -120) AS filter_date#411]\n : : : : +- Filter storeid#7081 IN (85,130,77,127,87)\n : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, to_date(cast(unix_timestamp(2020-07-15 21:17:18, yyyy-MM-dd, None) as timestamp), None) AS current_date#403]\n : : : : +- Filter isnotnull(tz_brand_id#7085)\n : : : : +- Filter NOT (storeid#7081 = 230)\n : : : :
Qty#3948, ... 10 more fields], false\n : +- Project [_1#4489 AS id#4492, _2#4490 AS features#4493]\n : +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#4489, staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(FloatType,false), fromPrimitiveArray, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#4490]\n : +- ExternalRDD [obj#4488]\n +- Project [_1#4499 AS id#4502, _2#4500 AS features#4503]\n +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#4499, staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(FloatType,false), fromPrimitiveArray, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#4500]\n +- ExternalRDD [obj#4498]\n'
遇到错误:
“已解析属性max#u dt#6786从storeid#3445、数量#3375、max#u dt#3299、min#u dt#3289、grossreceipts#3381、筛选器#日期#411、TZU品牌3549、ticketid 3587、dateclosed 3590、current#日期#403运算符中的403!过滤(截止日期关闭的那个些城市居民和(截止日期关闭的那个些城市居民和3390日城市居民和3390日城市居民和3390日城市居民和3390日城市居民和3299)和(截止日期关闭的那个些城市居民和(截止日期关闭的那个些城市居民和237日过滤器和(截止日期关闭的那个些城市居民和237)过滤(截止日期关闭的237)和(截止日期关闭的237)过滤器和(截止日期关闭的237)过滤器和(截止日期关闭的237)过滤(截止日期的237日)过滤器和237过滤(截止日期的237个截止日期的那个些7个截止日期)417个过滤器过滤器和417个截止日期的那个那个那个些过滤器和417个截止日期的那个些过滤器和411)417个过滤器过滤器和411个过滤器的那个那个那个那个么1)1个1个1个)男男男男1个1个)n::男男男男男男男男男男男男男男男男男ASFILTER#U date#411]\n:+-Filter storeid#292 IN(85130,77127,87)\n:+-Project[tz#u brand#u id#296,storeid#292,qty#222,dateclosed#237,grossreceipts#228,ticketid#234,至#u日期(unix#u时间戳(2020-07-15 21:17:18,YYYYYY MM dd,无)作为时间戳),无)作为当前#u日期#403\n:+-Filter isnull(TZ35u#u品牌识别号)\n:+-Filter NOT(storeid#292=230)\n:+-Project[tz#u brand#id#296,storeid#292,qty#222,dateclosed#237,grossreceipts#228,ticketid#234]\n:+-Filter(producttype#211=可食用)\n:+-LogicalRDDcbd-U-perc-C-C-U-perc-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-分类216,等级217,重量218,计量单位219,尺寸220,普华永道这是221个,数量是222个,75个更多的领域,、75个更多的领域,、假\n+n+n+n+n+n+n+n+n+n+n+n+n+n+n+这是221个,数量是221个,数量是222个,数量是222个,数量是222个,数量是222个,,,,,,,,,,,,,,,,,,,,,,,,,,,,[3581号,221-221个,军民民民民民dt dt TTT切切切切切切切切切切切切切切切切切切切切切切切切切蒂蒂,军军军军民dt TTTT5,678,6786,军军军军军民民民民dt TTTTT5,6786,6786,6786,军军军军军军民民民民民民民民民民民民切切切切切切切(正确)WindowsPec定义(storeid#7081,max#u dt#6786,prediction#4548 DESC NULLS LAST,specifiedwindowframe(RangeFrame,unboundredpreceding$(),currentrow$)作为tz品牌id 3560,[storeid 3581,max#dt 3586],[prediction 3548desc NULLS LAST]\n+-Project[storeid 3581,max 3581,max#dt 3586,max#dt NULLS 3585,prediction+-过滤器(n,预测4548)\n+-项目[storeid#7081,tz#brand#id#7085,max#dt#6786,接受日期#4366,UDF(功能#4493,功能#4503)作为预测3548]\n+-加入LeftOuter,(UDF(tz#brand#id 3585)=id 3585)=id 352)\n:-加入LeftOuter,(storeid 3592)\n::-重复数据消除[storeid#7081,tz#U brand#U id#7085,max#dt#6786,已接受日期#4366]\n::+-过滤器(转换(max#U dt 3586作为时间戳)<已接受日期#4366)|为空(已接受日期3566))\n:+-项目[storeid 3581,tz#brand U#id 3586,已接受日期3566]\n:+-Join LeftOuter,((storeid#7081=storeid#4102)和&(tz#u brand#id#7085=tz#u brand#id#4103))\n::-SubqueryAlias`a`\n::+-Join Cross\n::::-项目[storeid#7081,max#dt 6786]\n:::+-项目[storeid#7081,max#dt#6786]\n::::+-项目[TZU brand#id#7085,min#dt#3289,max#dt#6786,凝聚((品牌数量#3346/总数量3526),铸造(0为双倍))作为标准数量3572,storeid 3581]\n::::+-加入LeftOuter,(storeid#7081=storeid#3445)\n:::::-子QueryAlias`a`\n:::::+-项目[storeid#7081,min#dt#3289,max#dt#6786,tz#brand#id 3585,总和(数量)#3339作为品牌数量356]\n:::+-聚合[storeid#7081,min#dt#3289,max#dt#6786,tz#brand#id#7085],[storeid#7081,min#dt#3289,max 3586,tz#brand#id 3585,总和(数量7011)作为总和(数量)359]\_3339];:::(DateDT 3526+,min(dateclosed#7026 filter#date#411)\n::::+-项目[tz#U brand#id#7085,storeid#7081,qty#7011,dateclosed#7026,grossreceipts#7017,ticketid#7023,current#U date#403,date#添加(current#403#403)作为filter 351]\n::::+-过滤storeid#7081 IN(85130,77127,87)\n::::+-项目[tz#U品牌id#7085,storeid#7081,数量#7011,日期关闭#7026,总收入#7017,票号#7023,截止日期(unix#时间戳(2020-07-15 21:17:18,yyyy-MM-dd,无)作为时间戳),无)作为当前日期#403]\n:::::+-Filter不为空(tz:#品牌#id#7085)\n::::+-Fi
An error was encountered:
'Resolved attribute(s) max_dt#6786 missing from storeid#3445,qty#3375,max_dt#3299,min_dt#3289,grossreceipts#3381,filter_date#411,tz_brand_id#3449,ticketid#3387,dateclosed#3390,current_date#403 in operator !Filter ((dateclosed#3390 > min_dt#3289) && (dateclosed#3390 <= max_dt#6786)). Attribute(s) with the same name appear in the operation: max_dt. Please check if the right attribute(s) are used.;;\nJoin Inner, ((storeid#292 = storeid#7081) && (tz_brand_id#296 = tz_brand_id#4560))\n:- SubqueryAlias `a`\n: +- Filter ((dateclosed#237 > max_dt#3299) && (dateclosed#237 <= date_add(max_dt#3299, 30)))\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234, current_date#403, filter_date#411, min_dt#3289, date_add(filter_date#411, 60) AS max_dt#3299]\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234, current_date#403, filter_date#411, date_add(filter_date#411, 0) AS min_dt#3289]\n: +- Filter (dateclosed#237 > filter_date#411)\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234, current_date#403, date_add(current_date#403, -120) AS filter_date#411]\n: +- Filter storeid#292 IN (85,130,77,127,87)\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234, to_date(cast(unix_timestamp(2020-07-15 21:17:18, yyyy-MM-dd, None) as timestamp), None) AS current_date#403]\n: +- Filter isnotnull(tz_brand_id#296)\n: +- Filter NOT (storeid#292 = 230)\n: +- Project [tz_brand_id#296, storeid#292, qty#222, dateclosed#237, grossreceipts#228, ticketid#234]\n: +- Filter (producttype#211 = EDIBLE)\n: +- LogicalRDD [cbd_perc#199, thc_perc#200, register#201, customer_type#202, type#203, customer_state#204, customer_city#205, zip_code#206, age#207, age_group#208, cashier#209, approver#210, producttype#211, productsubtype#212, productattributes#213, productbrand#214, productname#215, classification#216, tier#217, weight#218, unitofmeasure#219, size#220, priceunit#221, qty#222, ... 75 more fields], false\n+- SubqueryAlias `b`\n +- Project [storeid#7081, max_dt#6786, tz_brand_id#4560]\n +- Project [storeid#7081, max_dt#6786, tz_brand_id#7085, prediction#4548, tz_brand_id#4560, tz_brand_id#4560]\n +- Window [first(tz_brand_id#7085, true) windowspecdefinition(storeid#7081, max_dt#6786, prediction#4548 DESC NULLS LAST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS tz_brand_id#4560], [storeid#7081, max_dt#6786], [prediction#4548 DESC NULLS LAST]\n +- Project [storeid#7081, max_dt#6786, tz_brand_id#7085, prediction#4548]\n +- Filter AtLeastNNulls(n, prediction#4548)\n +- Project [storeid#7081, tz_brand_id#7085, max_dt#6786, accepted_date#4366, UDF(features#4493, features#4503) AS prediction#4548]\n +- Join LeftOuter, (UDF(tz_brand_id#7085) = id#4502)\n :- Join LeftOuter, (UDF(storeid#7081) = id#4492)\n : :- Deduplicate [storeid#7081, tz_brand_id#7085, max_dt#6786, accepted_date#4366]\n : : +- Filter ((cast(max_dt#6786 as timestamp) < accepted_date#4366) || isnull(accepted_date#4366))\n : : +- Project [storeid#7081, tz_brand_id#7085, max_dt#6786, accepted_date#4366]\n : : +- Join LeftOuter, ((storeid#7081 = storeid#4102) && (tz_brand_id#7085 = tz_brand_id#4103))\n : : :- SubqueryAlias `a`\n : : : +- Join Cross\n : : : :- Project [storeid#7081, max_dt#6786]\n : : : : +- Project [storeid#7081, max_dt#6786]\n : : : : +- Project [tz_brand_id#7085, min_dt#3289, max_dt#6786, coalesce((brand_qty#3346 / total_qty#3326), cast(0 as double)) AS norm_qty#3472, storeid#7081]\n : : : : +- Join LeftOuter, (storeid#7081 = storeid#3445)\n : : : : :- SubqueryAlias `a`\n : : : : : +- Project [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085, sum(qty)#3339 AS brand_qty#3346]\n : : : : : +- Aggregate [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085], [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085, sum(qty#7011) AS sum(qty)#3339]\n : : : : : +- Filter ((dateclosed#7026 > min_dt#3289) && (dateclosed#7026 <= max_dt#6786))\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, filter_date#411, min_dt#3289, date_add(filter_date#411, 60) AS max_dt#6786]\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, filter_date#411, date_add(filter_date#411, 0) AS min_dt#3289]\n : : : : : +- Filter (dateclosed#7026 > filter_date#411)\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, date_add(current_date#403, -120) AS filter_date#411]\n : : : : : +- Filter storeid#7081 IN (85,130,77,127,87)\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, to_date(cast(unix_timestamp(2020-07-15 21:17:18, yyyy-MM-dd, None) as timestamp), None) AS current_date#403]\n : : : : : +- Filter isnotnull(tz_brand_id#7085)\n : : : : : +- Filter NOT (storeid#7081 = 230)\n : : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023]\n : : : : : +- Filter (producttype#7000 = EDIBLE)\n : : : : : +- LogicalRDD [cbd_perc#6988, thc_perc#6989, register#6990, customer_type#6991, type#6992, customer_state#6993, customer_city#6994, zip_code#6995, age#6996, age_group#6997, cashier#6998, approver#6999, producttype#7000, productsubtype#7001, productattributes#7002, productbrand#7003, productname#7004, classification#7005, tier#7006, weight#7007, unitofmeasure#7008, size#7009, priceunit#7010, qty#7011, ... 75 more fields], false\n : : : : +- SubqueryAlias `b`\n : : : : +- Project [storeid#3445, sum(qty)#3320 AS total_qty#3326]\n : : : : +- !Aggregate [storeid#3445, min_dt#3289, max_dt#6786], [storeid#3445, min_dt#3289, max_dt#6786, sum(qty#3375) AS sum(qty)#3320]\n : : : : +- !Filter ((dateclosed#3390 > min_dt#3289) && (dateclosed#3390 <= max_dt#6786))\n : : : : +- Project [tz_brand_id#3449, storeid#3445, qty#3375, dateclosed#3390, grossreceipts#3381, ticketid#3387, current_date#403, filter_date#411, min_dt#3289, date_add(filter_date#411, 60) AS max_dt#3299]\n : : : : +- Project [tz_brand_id#3449, storeid#3445, [tz_brand_id#7085, max_dt#6786]\n : : : +- Project [tz_brand_id#7085, min_dt#3289, max_dt#6786, coalesce((brand_qty#3346 / total_qty#3326), cast(0 as double)) AS norm_qty#3472, storeid#7081]\n : : : +- Join LeftOuter, (storeid#7081 = storeid#3445)\n : : : :- SubqueryAlias `a`\n : : : : +- Project [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085, sum(qty)#3339 AS brand_qty#3346]\n : : : : +- Aggregate [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085], [storeid#7081, min_dt#3289, max_dt#6786, tz_brand_id#7085, sum(qty#7011) AS sum(qty)#3339]\n : : : : +- Filter ((dateclosed#7026 > min_dt#3289) && (dateclosed#7026 <= max_dt#6786))\n : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, filter_date#411, min_dt#3289, date_add(filter_date#411, 60) AS max_dt#6786]\n : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, filter_date#411, date_add(filter_date#411, 0) AS min_dt#3289]\n : : : : +- Filter (dateclosed#7026 > filter_date#411)\n : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, current_date#403, date_add(current_date#403, -120) AS filter_date#411]\n : : : : +- Filter storeid#7081 IN (85,130,77,127,87)\n : : : : +- Project [tz_brand_id#7085, storeid#7081, qty#7011, dateclosed#7026, grossreceipts#7017, ticketid#7023, to_date(cast(unix_timestamp(2020-07-15 21:17:18, yyyy-MM-dd, None) as timestamp), None) AS current_date#403]\n : : : : +- Filter isnotnull(tz_brand_id#7085)\n : : : : +- Filter NOT (storeid#7081 = 230)\n : : : :
Qty#3948, ... 10 more fields], false\n : +- Project [_1#4489 AS id#4492, _2#4490 AS features#4493]\n : +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#4489, staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(FloatType,false), fromPrimitiveArray, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#4490]\n : +- ExternalRDD [obj#4488]\n +- Project [_1#4499 AS id#4502, _2#4500 AS features#4503]\n +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#4499, staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(FloatType,false), fromPrimitiveArray, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#4500]\n +- ExternalRDD [obj#4498]\n'
from pyspark.sql import functions as F
tsteval = sc.parallelize([
(2847, 87, 1.0, "2020-06-15", 21.1453375, "02c8ec06-a75a-4dd2-89e2-dbbf1dxxxxxx", "2020-05-16"), (3285, 127, 1.0, "2020-06-02", 20.642125,"941d2889-230f-4a19-9cb9-90f7b2xxxxxx", "2020-05-16"),
(2799, 127, 1.0, "2020-06-03", 23.642125, "997ab6b3-b4b5-48e4-884d-00834xxxxxx", "2020-05-16")
]).toDF(["tz_brand_id", "storeid", "qty", "dateclosed", "grossreceipts ","ticketid", "max_dt "])
tsteval_rn = tsteval.withColumnRenamed("storeid", "storeid_a")
tsteval_rn.show()
# +-----------+---------+---+----------+--------------+--------------------+----------+
# |tz_brand_id|storeid_a|qty|dateclosed|grossreceipts | ticketid| max_dt |
# +-----------+---------+---+----------+--------------+--------------------+----------+
# | 2847| 87|1.0|2020-06-15| 21.1453375|02c8ec06-a75a-4dd...|2020-05-16|
# | 3285| 127|1.0|2020-06-02| 20.642125|941d2889-230f-4a1...|2020-05-16|
# | 2799| 127|1.0|2020-06-03| 23.642125|997ab6b3-b4b5-48e...|2020-05-16|
# +-----------+---------+---+----------+--------------+--------------------+----------+
top_rec = sc.parallelize([
(127, "2020-05-16", 2799), (127, "2020-05-16", 2799)
]).toDF(["storeid", "date", "tz_brand_id"])
top_rec.show()
# +-------+----------+-----------+
# |storeid| date|tz_brand_id|
# +-------+----------+-----------+
# | 127|2020-05-16| 2799|
# | 127|2020-05-16| 2799|
# +-------+----------+-----------+
df3 = tsteval_rn.join(top_rec, [(tsteval_rn.storeid_a==top_rec.storeid)&(tsteval_rn.tz_brand_id == top_rec.tz_brand_id)], how='inner')
df3.select(F.col('storeid_a').alias("storeid"),'ticketid').dropDuplicates().show(truncate=False)
# +-------+-----------------------------------+
# |storeid|ticketid |
# +-------+-----------------------------------+
# |127 |997ab6b3-b4b5-48e4-884d-00834xxxxxx|
# +-------+-----------------------------------+