Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/322.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python PySpark和广播连接示例_Python_Apache Spark_Apache Spark Sql_Pyspark - Fatal编程技术网

Python PySpark和广播连接示例

Python PySpark和广播连接示例,python,apache-spark,apache-spark-sql,pyspark,Python,Apache Spark,Apache Spark Sql,Pyspark,我正在使用Spark 1.3 # Read from text file, parse it and then do some basic filtering to get data1 data1.registerTempTable('data1') # Read from text file, parse it and then do some basic filtering to get data1 data2.registerTempTable('data2') # Perform

我正在使用Spark 1.3

# Read from text file, parse it and then do some basic filtering to get   data1
data1.registerTempTable('data1')

# Read from text file, parse it and then do some basic filtering to get data1
data2.registerTempTable('data2')

# Perform join
data_joined = data1.join(data2, data1.id == data2.id);

我的数据非常倾斜,data2(几KB)Spark 1.3不支持使用DataFrame进行广播连接。在Spark>=1.5.0中,您可以使用
broadcast
功能应用广播连接:

from pyspark.sql.functions import broadcast

data1.join(broadcast(data2), data1.id == data2.id)
对于旧版本,唯一的选择是转换为RDD并应用与其他语言相同的逻辑。大致如下:

from pyspark.sql import Row
from pyspark.sql.types import StructType

# Create a dictionary where keys are join keys
# and values are lists of rows
data2_bd = sc.broadcast(
    data2.map(lambda r: (r.id, r)).groupByKey().collectAsMap())


# Define a new row with fields from both DFs
output_row = Row(*data1.columns + data2.columns)

# And an output schema
output_schema = StructType(data1.schema.fields + data2.schema.fields)

# Given row x, extract a list of corresponding rows from broadcast
# and output a list of merged rows
def gen_rows(x):
    return [output_row(*x + y) for y in data2_bd.value.get(x.id, [])]

# flatMap and create a new data frame
joined = data1.rdd.flatMap(lambda row: gen_rows(row)).toDF(output_schema)
此代码在spark-2.0.2-bin-hadoop2.7版本中工作
pyspark.sql.functions.broadcast
最早出现在1.6中,根据1.6中添加的pyspark包装中的@NicholasWhite,但Scala方法从1.5开始就可用,因此您也可以在1.5中使用它。
from pyspark.sql import SparkSession

from pyspark.sql.functions import broadcast

spark = SparkSession.builder.appName("Python Spark SQL basic 
example").config("spark.some.config.option", "some-value").getOrCreate()

df2 = spark.read.csv("D:\\trans_mar.txt",sep="^");

df1=spark.read.csv("D:\\trans_feb.txt",sep="^");

print(df1.join(broadcast(df2),df2._c77==df1._c77).take(10))