Python Pyspark SparkSQL:外部联接问题_Python_Apache Spark_Apache Spark Sql_Pyspark - Fatal编程技术网

Python Pyspark SparkSQL:外部联接问题

python apache-spark pyspark

Python Pyspark SparkSQL:外部联接问题,python,apache-spark,apache-spark-sql,pyspark,Python,Apache Spark,Apache Spark Sql,Pyspark,我正在使用pyspark，我的外部连接有这个问题。基本上，如果我使用列名列表作为“ON”条件，那么连接的结果是内部连接，而不管我是否指定了“outer_left”选项。如果我指定完全相等（即df1.id==df2.id），而不是列名列表，问题就会消失换言之： testDf = sc.parallelize([['a', 1], ['b', 1]]).toDF(['id', 'val1']) testDf2 = sc.parallelize([['a', 2]]).toDF(['id', 'va

我正在使用pyspark，我的外部连接有这个问题。基本上，如果我使用列名列表作为“ON”条件，那么连接的结果是内部连接，而不管我是否指定了“outer_left”选项。如果我指定完全相等（即df1.id==df2.id），而不是列名列表，问题就会消失

换言之：

testDf = sc.parallelize([['a', 1], ['b', 1]]).toDF(['id', 'val1'])
testDf2 = sc.parallelize([['a', 2]]).toDF(['id', 'val2'])
cond = [testDf.id == testDf2.id]
testDf.join(testDf2, cond, how='left_outer').collect()

将返回左右外部联接：

[Row(id=u'a', val1=1, id=u'a', val2=2),Row(id=u'b', val1=1, id=None, val2=None)]

但是如果我使用

testDf.join(testDf2, 'id', how='left_outer').collect()

它将返回一个内部联接

[Row(id=u'a', val1=1, val2=2)]

你能帮我理解为什么吗？

非常感谢

正如官方声明所述：

如果

on

是一个字符串或一组字符串，指示联接列的名称

列的两侧必须存在，这将执行内部等联接

[apache spark]相关文章推荐

Apache spark 阅读多个主题 apache-spark apache-kafka

Apache spark 将LZ4与Apache Spark一起使用 apache-spark hadoop compression

Apache spark 工作进程无法连接到同一台计算机上的主进程（无效的关联）-即使url是正确的 apache-spark

Apache spark 在独立群集上运行PI示例时出现的问题 apache-spark

Apache spark spark sklearn，模块对象没有属性'_fit_和_分数'； apache-spark scikit-learn pyspark

Apache spark Spark MLLIB并行多节点 apache-spark

Apache spark 如何为PySpark的窗口函数设置分区？ apache-spark pyspark

Apache spark 获取列spark数据帧中唯一行数的最优化方法 apache-spark pyspark

Apache spark 如何将元数据附加到pyspark中的双列 apache-spark pyspark

Apache spark 管道无法使用JPMML和Pyspark正确转换为PMML apache-spark pyspark

Apache spark 在write_df中命名csv文件 apache-spark

Apache spark Spark MLlib关联规则可信度大于1.0 apache-spark

Apache spark 在Windows 7上安装Apache Spark | Spark shell不工作 apache-spark windows-7 installation

Apache spark 如果在avro模式中添加了新列，则触发sql saveAsTable create table append模式 apache-spark

Apache spark 组合多个Rocksdb数据库 apache-spark

Apache spark 如何使用Spark/PySpark删除雪花目标表 apache-spark hadoop pyspark snowflake-cloud-data-platform

Apache spark GroupBy和window函数在Spark SQL中如何交互？ apache-spark

Apache spark 使用Nifi阅读多个来源，在卡夫卡中分组主题，并使用Spark订阅 apache-spark apache-kafka apache-nifi

Apache spark Databricks在作业完成时两次触发作业写入/更新成功文件 apache-spark amazon-s3

Apache spark 在Pyspark的新列中添加与特定列值对应的多列值 apache-spark pyspark

随机文章推荐

Indexing MongoDB-当嵌入的键是URI时索引嵌入的键 indexing mongodb

Indexing InnoDB：如果不为NULL，则为唯一 indexing

Indexing 我想在excel上返回多个值的列表 indexing

Indexing 如何在datatable vbnet中按索引选择列 indexing

Indexing marklogic中元素范围索引和字段范围索引的区别是什么？ indexing marklogic

Indexing VFP。重新创建索引 indexing

Elasticsearch 弹性搜索+；Kibana，在uri上排序不会产生任何结果。（未对uri进行分析） indexing kibana

Indexing 取函数'；无法工作，无法发送到RavenDB进行查询 indexing ravendb

Elasticsearch ES索引名和Stormcrawler indexing web-crawler

Indexing Cassandra中的二级索引支持？ indexing cassandra nosql

Indexing 更改沙发床'；s索引以使用num_副本 indexing couchbase

Indexing 如何制作使用索引和匹配函数显示多个响应的excel公式？ indexing excel-formula

Indexing Lua，如何访问使用数组的索引 indexing lua

[python]相关推荐

Python 在熊猫数据帧中是否有快速回溯计算功能？
Python Pandas

Python 图像中的峰值邻域检测
Python

Python 从多值字典访问最频繁的子值
Python Python 2.7 Dictionary

Python 如何移动所有数字以与列对齐
Python Python 3.x

Python 为什么我需要为Alchemy'；s列构造函数？
Python Database

Python 子项在pexpect中超时之前死亡
Python

Python中的GPS航向
Python Gps

Python 我必须为s=Sound（）导入哪个库？
Python

Python 使用二维布尔从三维numpy数组中提取一维数组
Python Arrays Numpy Indexing

用于循环字符计数的Python
Python Python 3.x Coding Style Computer Science

Python 仅允许QTableWidget列中的值
Python

Python 通道电报回音机器人
Python Python 3.x Telegram

Python中的这一行在numpy数组上做什么？
Python Numpy

使用Beautiful Soup&；从CSV中刮取多个URL；python
Python Csv Url

Python 如何每三次发布django广告
Python Django

Python 通过将递归方法转换为迭代方法来加快其速度
Python Algorithm Recursion

Python 谷歌网站管理员API对每个请求都给出了500：后端错误的响应
Python Api

在Python中分组和合并两个数据帧
Python Pandas

Python 使用struct模块打包十六进制字符串
Python Struct

Python 将多个列表中的元素打印为组合行
Python Python 3.x

Python/Flask/Gunicorn至Heroku
Python Heroku

Python 如何在tensorflow'；s源代码
Python Tensorflow Machine Learning Deep Learning

Python 以numpy为单位的列表列表的平均值
Python Numpy

Python 熊猫会连续删除异常值
Python Pandas

Python Tkinter绑定
Python Tkinter

Python NumPy：从二维数组索引一个元素
Python Arrays Numpy

Python 将PySpark Dataframe列拆分为多个
Python Apache Spark Pyspark

Python 皮奥图古南
Python

减少python o中3个循环的时间复杂性（n^3）
Python Time Complexity

为什么Python中的列表元素是从左到右索引的？我怎样才能扭转这种局面？
Python

Tags

Cluster Computing Signalr Ansible Silverstripe Julia Raspberry Pi Algorithm Ios7 Moodle Jekyll Cocos2d X Gtk Xsd Asp.net Mvc 3 Frameworks Yocto Vb6 Macros Vuejs2 Windows Installer Ajax Gruntjs Ocaml Syntax Sas Smalltalk Rest Google Cloud Dataflow Scala Rabbitmq Mod Rewrite Asp.net Mvc Log4net Alfresco Dialogflow Es Hazelcast Imagemagick Tcl Dynamics Crm 2011 Rdf Webview Less Excel Antlr4 Office Js Ravendb Xquery Permissions Stata Opencart Browser Robotframework Streaming Google Colaboratory Titanium Html Graphviz Charts View Osgi Ruby On Rails 3 Kubernetes Exception Handling Camera Animation Autohotkey Automation Shiny Java Activemq Gridview Io Leaflet Actionscript 3 Memory Management Directx Instagram Sails.js Character Encoding Html5 Canvas Gwt Boost Mobile Tomcat Groovy Pandas Installation Primefaces Sublimetext2 Web Crawler Triggers Jboss Wicket Graphics Go Blockchain Merge Plone Processing Ssl Process Coding Style Mapreduce Url Rewriting Doctrine Orm Weblogic Ruby On Rails 3.1 Drools Angular6 Matrix Jquery Mobile Binary Kentico Asp.net Cmd Hive Web Services Google Visualization 3d Oop Telegram Webpack Sencha Touch Modelica Servlets Sql Server Stream Coq Oracle11g Jaxb Model View Controller Vector Openstack Artifactory Php Programming Languages Video Streaming C# Eclipse C++11 Windows 8 Mips Printing Terraform Talend Socket.io Web Applications Assembly Dynamics Crm Jira Geolocation Coldfusion Certificate Sqlalchemy Sequelize.js Aws Lambda Virtual Machine .net Curl Ms Word Google Maps Api 3 Batch File Openssl Typescript Json Text Jsf Autocomplete Cocoa Android Tags .net 4.0 Entity Framework For Loop Odoo Azure Data Factory Xampp Opencv Material Ui Netsuite Grep Button Xamarin.android Spring Mvc Openerp Haskell Lisp Xna Netty Ag Grid Service Adobe Web Proxy Path Mvvm Login Tridion Functional Programming Cobol Responsive Design

Copyright © 2024. All Rights Reserved by - Fatal编程技术网