Pandas 在spark sql中,连接两个数据帧时出现问题。请解决我的问题
我有两个数据帧,第一个数据帧10列,(street,state(行是CA,US)。等等)和第二个数据帧两列(state和state全名)。我想使用state连接这两个数据帧,但我不想在替换状态下使用全名连接state列 我用过,Pandas 在spark sql中,连接两个数据帧时出现问题。请解决我的问题,pandas,apache-spark,apache-spark-sql,Pandas,Apache Spark,Apache Spark Sql,我有两个数据帧,第一个数据帧10列,(street,state(行是CA,US)。等等)和第二个数据帧两列(state和state全名)。我想使用state连接这两个数据帧,但我不想在替换状态下使用全名连接state列 我用过, tranDF.join(stateDF,tranDF(“state”)==stateDF(“state”),“inner”).show(false) 我需要的专栏是 street city state_NM beds ...etc 我想要stateDF中的一列取代tr
tranDF.join(stateDF,tranDF(“state”)==stateDF(“state”),“inner”).show(false)
我需要的专栏是
street city state_NM beds ...etc
我想要stateDF中的一列取代tranDF中的state列,请任何人回答我的问题检查下面的代码是否适用于您
joinDF= (tranDF.alias("a").join(stateDF.alias("b"),
col("a.state") == col("b.state") ,how='inner')
.drop(col("a.state")).drop(col("b.state")))
下面的方法应该有效
trandf.join(statedf,trandf("state")===statedf("state"),"inner")
.selectExpr("trans.street", "trans.city", "state.statefullname", "trans.type")
.show(false)
说明:为每个df创建别名为“trans
”和“state
”
内部联接后,选择仅需要且相关的列。使用选择
或选择expr
,如下所示
spark和scala的完整示例以及沃尔玛的数据
package examples
import examples.JoinDemo.trandf
import org.apache.log4j.Level
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object JoinDemo extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder().appName("JoinDemo").master("local").getOrCreate()
import spark.implicits._
val mycsvdata = """
|"statefullname","state"
|"Alabama","AL"
|"Alaska","AK"
|"Arizona","AZ"
|"Arkansas","AR"
|"California","CA"
|"Colorado","CO"
|"Connecticut","CT"
|"Delaware","DE"
|"District of Columbia","DC"
|"Florida","FL"
|"Georgia","GA"
|"Hawaii","HI"
|"Idaho","ID"
|"Illinois","IL"
|"Indiana","IN"
|"Iowa","IA"
|"Kansas","KS"
|"Kentucky","KY"
|"Louisiana","LA"
|"Maine","ME"
|"Montana","MT"
|"Nebraska","NE"
|"Nevada","NV"
|"New Hampshire","NH"
|"New Jersey","NJ"
|"New Mexico","NM"
|"New York","NY"
|"North Carolina","NC"
|"North Dakota","ND"
|"Ohio","OH"
|"Oklahoma","OK"
|"Oregon","OR"
|"Maryland","MD"
|"Massachusetts","MA"
|"Michigan","MI"
|"Minnesota","MN"
|"Mississippi","MS"
|"Missouri","MO"
|"Pennsylvania","PA"
|"Rhode Island","RI"
|"South Carolina","SC"
|"South Dakota","SD"
|"Tennessee","TN"
|"Texas","TX"
|"Utah","UT"
|"Vermont","VT"
|"Virginia","VA"
|"Washington","WA"
|"West Virginia","WV"
|"Wisconsin","WI"
|"Wyoming","WY"
""".stripMargin.lines.toList.toDS
val mycsvdata1 =
"""
|"opendate","street","city","state","long","lat","type"
|1962-03-01,"5801 SW Regional Airport Blvd","Bentonville","AR",-94.239816,36.350885,"DistributionCenter"
|1962-07-01,"2110 WEST WALNUT","Rogers","AR",-94.07141,36.342235,"SuperCenter"
|1964-08-01,"1417 HWY 62/65 N","Harrison","AR",-93.09345,36.236984,"SuperCenter"
|1965-08-01,"2901 HWY 412 EAST","Siloam Springs","AR",-94.50208,36.179905,"SuperCenter"
|1967-10-01,"3801 CAMP ROBINSON RD.","North Little Rock","AR",-92.30229,34.813269,"Wal-MartStore"
|1967-10-01,"1621 NORTH BUSINESS 9","Morrilton","AR",-92.75858,35.156491,"SuperCenter"
|1968-03-01,"1303 SOUTH MAIN","Sikeston","MO",-89.58355,36.891163,"SuperCenter"
|1968-03-01,"65 WAL-MART DRIVE","Mountain Home","AR",-92.35781,36.329026,"SuperCenter"
|1968-07-01,"2020 SOUTH MUSKOGEE","Tahlequah","OK",-94.97185,35.923658,"SuperCenter"
|1968-07-01,"1500 LYNN RIGGS BLVD","Claremore","OK",-95.61192,36.327143,"SuperCenter"
|1968-11-01,"2705 GRAND AVE","Carthage","MO",-94.31164,37.168985,"SuperCenter"
|1969-04-01,"1800 S JEFFERSON","Lebanon","MO",-92.64733,37.678528,"SuperCenter"
|1969-04-01,"2214 FAYETTEVILLE RD","Van Buren","AR",-94.34581,35.456536,"SuperCenter"
|1969-05-01,"1310 PREACHER RD/HGWY 160","West Plains","MO",-91.87408,36.719145,"SuperCenter"
|1969-05-01,"3200 LUSK DRIVE","Neosho","MO",-94.39016,36.86429,"SuperCenter"
|1969-11-01,"2500 MALCOLM ST/HWY 67 NORTH","Newport","AR",-91.24695,35.586065,"Wal-MartStore"
|1970-03-01,"185 ST ROBERT BLVD","St. Robert","MO",-92.135741,37.827415,"SuperCenter"
|1970-10-01,"1712 EAST OHIO","Clinton","MO",-93.76042,38.364214,"SuperCenter"
|1970-10-01,"4901 SO. MILL ROAD","Pryor","OK",-95.30295,36.294174,"SuperCenter"
|1970-11-01,"1201 N SERVICE ROAD EAST","Ruston","LA",-92.64696,32.52476,"SuperCenter"
|1970-11-01,"3450 S. 4TH TRAFFICWAY","Leavenworth","KS",-94.93555,39.298776,"Wal-MartStore"
|1971-02-01,"4820 SO. CLARK ST","Mexico","MO",-91.88404,39.179316,"SuperCenter"
|1971-02-01,"1101 HWY 32 WEST","Salem","MO",-91.51423,37.630896,"SuperCenter"
|1971-04-01,"2000 JOHN HARDEN DR","Jacksonville","AR",-92.12244,34.879419,"SuperCenter"
|1971-05-01,"2415 N.W. MAIN ST","Miami","OK",-94.87142,36.880746,"SuperCenter"
|1971-06-01,"3108 N BROADWAY","Poteau","OK",-94.61829,35.052793,"SuperCenter"
|1971-06-01,"2050 WEST HWY 76","Branson","MO",-93.25668,36.64417,"Wal-MartStore"
|1971-06-01,"1710 SO. 4TH ST","Nashville","AR",-93.85214,33.985613,"SuperCenter"
|1971-08-01,"724 STADIUM WEST BLVD","Jefferson City","MO",-92.25329,38.568287,"SuperCenter"
|1971-09-01,"701 WALTON DRIVE","Farmington","MO",-90.41404,37.779206,"SuperCenter"
|1971-10-01,"101 EAST BLUEMONT AVENUE","Manhattan","KS",-96.56932,39.184986,"SuperCenter"
|1971-11-01,"2025 BUS. HWY 60 WEST","Dexter","MO",-89.97428,36.784453,"SuperCenter"
|1971-11-01,"2250 LINCOLN AVENUE","Nevada","MO",-94.35075,37.838563,"SuperCenter"
|1971-11-01,"2802 WEST KINGS HIGHWAY","Paragould","AR",-90.5102,36.065711,"SuperCenter"
|1971-11-01,"1301 HWY 24 EAST","Moberly","MO",-92.4344,39.420353,"SuperCenter"
|1971-12-09,"1907 SE WASHINGTON ST.","Idabel","OK",-94.83154,33.883578,"SuperCenter"
|1972-02-01,"1802 SOUTH BUSINESS HWY 54","Eldon","MO",-92.58395,38.311355,"Wal-MartStore"
|1972-03-01,"2400 SOUTH MAIN","Fort Scott","KS",-94.73389,37.823295,"Wal-MartStore"
|1972-05-01,"1155 HWY 65 NORTH","Conway","AR",-92.43401,35.075467,"SuperCenter"
|1972-05-01,"4000 GREEN COUNTRY RD","Bartlesville","OK",-95.92404,36.733398,"SuperCenter"
""".stripMargin.lines.toList.toDS
val trandf: DataFrame = spark.read.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.csv(mycsvdata1).as("trans")
val statedf: DataFrame = spark.read.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.csv(mycsvdata).as("state")
trandf.join(statedf,trandf("state")===statedf("state"),"inner")
.selectExpr("trans.street", "trans.city", "state.statefullname", "trans.type") // you want only columns from state df
.show(false)
}
结果:
+--------------------------+--------------+-------------+-------------+
|street |city |statefullname|type |
+--------------------------+--------------+-------------+-------------+
|1201 N SERVICE ROAD EAST |Ruston |Louisiana |SuperCenter |
|1303 SOUTH MAIN |Sikeston |Missouri |SuperCenter |
|2705 GRAND AVE |Carthage |Missouri |SuperCenter |
|1800 S JEFFERSON |Lebanon |Missouri |SuperCenter |
|1310 PREACHER RD/HGWY 160 |West Plains |Missouri |SuperCenter |
|3200 LUSK DRIVE |Neosho |Missouri |SuperCenter |
|185 ST ROBERT BLVD |St. Robert |Missouri |SuperCenter |
|1712 EAST OHIO |Clinton |Missouri |SuperCenter |
|4820 SO. CLARK ST |Mexico |Missouri |SuperCenter |
|1101 HWY 32 WEST |Salem |Missouri |SuperCenter |
|2050 WEST HWY 76 |Branson |Missouri |Wal-MartStore|
|724 STADIUM WEST BLVD |Jefferson City|Missouri |SuperCenter |
|701 WALTON DRIVE |Farmington |Missouri |SuperCenter |
|2025 BUS. HWY 60 WEST |Dexter |Missouri |SuperCenter |
|2250 LINCOLN AVENUE |Nevada |Missouri |SuperCenter |
|1301 HWY 24 EAST |Moberly |Missouri |SuperCenter |
|1802 SOUTH BUSINESS HWY 54|Eldon |Missouri |Wal-MartStore|
|3450 S. 4TH TRAFFICWAY |Leavenworth |Kansas |Wal-MartStore|
|101 EAST BLUEMONT AVENUE |Manhattan |Kansas |SuperCenter |
|2400 SOUTH MAIN |Fort Scott |Kansas |Wal-MartStore|
+--------------------------+--------------+-------------+-------------+
only showing top 20 rows
我不知道spark Api中是否存在像这样的关键字
、how=
。这一个在pyspark中可用。谢谢,spark sql命令在我的cmd中不起作用,例如spark.sql(“从表中选择*”),这不起作用,但joinDf.select(“名称”)。显示它的作用,我不知道它背后的reson。你能回答吗also@uppalaadarsh伙计,请在提交之前校对一下你的文章。这真的很难读。如果您想在Spark中使用SQL,您必须首先将数据帧注册为SQL表,否则,Spark如何知道所有数据帧中的名称table
指的是什么?如果您同意,请注意接受,下面是发布时的一些建议。1) 这个问题可能有清晰的数据示例(您可以发布示例csv或数据帧),很难给出可生产的答案。2) 措辞和句子应该清晰,并且应该揭示你所问问题的意图。3) 避免拼写错误4)不要急于发帖,像阅读问题的其他人一样复习你的问题。有了这些,你就能更好地回答你的问题。牢记