PySpark:Dataframe，具有关系表的嵌套字段_Dataframe_Apache Spark_Pyspark_Nested - Fatal编程技术网

PySpark:Dataframe，具有关系表的嵌套字段

dataframe apache-spark pyspark

PySpark:Dataframe，具有关系表的嵌套字段,dataframe,apache-spark,pyspark,nested,Dataframe,Apache Spark,Pyspark,Nested,我有一个PySpark学生数据框架，模式如下： Id: string |-- School: array |-- element: struct | |-- Subject: string | |-- Classes: string | |-- Score: array | |-- element: struct | |-- ScoreID: string | |-- Value: string 我想从数据帧中提取

我有一个PySpark学生数据框架，模式如下：

Id: string
 |-- School: array
 |-- element: struct
 |   |-- Subject: string
 |   |-- Classes: string
 |   |-- Score: array
 |       |-- element: struct
 |           |-- ScoreID: string
 |           |-- Value: string

我想从数据帧中提取一些字段，并对其进行规范化，以便将其输入数据库。我期望的关系模式由字段

Id、School、Subject、ScoreId、Value

组成。如何有效地执行此操作？

分解数组以获取展平数据，然后选择所有必需的列示例： df.show(10,False) #+---+--------------------------+ #|Id |School | #+---+--------------------------+ #|1 |[[b, [[A, 3], [B, 4]], a]]| #+---+--------------------------+ df.printSchema() #root # |-- Id: string (nullable = true) # |-- School: array (nullable = true) # | |-- element: struct (containsNull = true) # | | |-- Classes: string (nullable = true) # | | |-- Score: array (nullable = true) # | | | |-- element: struct (containsNull = true) # | | | | |-- ScoreID: string (nullable = true) # | | | | |-- Value: string (nullable = true) # | | |-- Subject: string (nullable = true) df.selectExpr("Id","explode(School)").\ selectExpr("Id","col.*","explode(col.Score)").\ selectExpr("Id","Classes","Subject","col.*").\ show() #+---+-------+-------+-------+-----+ #| Id|Classes|Subject|ScoreID|Value| #+---+-------+-------+-------+-----+ #| 1| b| a| A| 3| #| 1| b| a| B| 4| #+---+-------+-------+-------+-----+

[apache spark]相关文章推荐

Apache spark Spark流应用程序中的异常处理 apache-spark

Apache spark aws仪表盘上的火花卡在UDF中 apache-spark pyspark

Apache spark Python pyspark数组_包含不区分大小写的字符 apache-spark pyspark

Apache spark 火花&x27；数据集非持久性行为 apache-spark

Apache spark 具有多个特征的GraphX边 apache-spark

Apache spark 如何将作业提交到其他集群上的纱线？ apache-spark hadoop

Apache spark spark submit--master--local[4]是将整个应用程序限制为4核，还是仅限于spark workers？ apache-spark

Apache spark 为什么Cassandra TableWriter写0条记录以及如何修复？ apache-spark

Apache spark Spark Streaming和Spark Structured Streaming是否使用相同的微批量引擎？ apache-spark

Apache spark 当谈到jdbc时，为什么spark比sqoop慢？ apache-spark jdbc

Apache spark 发送结果RPResponse/关闭连接时出错-Datastax Enterprise apache-spark

Apache spark Spark\u UDF的序列化错误 apache-spark serialization pyspark

Apache spark 如何理解Spark中的任务并行性是否有效？ apache-spark pyspark

Apache spark 在SPARK中操作RDD，将线合并到逐行分隔符中 apache-spark

Apache spark 在Pyspark中屏蔽/更换字符串列的内部 apache-spark pyspark

Apache spark 如何从PyCharm连接Databricks社区版群集 apache-spark pyspark

Apache spark 使用pyspark 2.4.4中的火花流 apache-spark pyspark apache-kafka

Apache spark 控制批大小以提示调度程序性能 apache-spark

Apache spark 显示Spark结构化流媒体作业消耗的事件数 apache-spark

Apache spark z、不在齐柏林飞艇上运行（“段落ID”） apache-spark

随机文章推荐

Calendar 让日历对任何人都可编辑？ calendar

Calendar iCal：如何删除已订阅的.ics文件？ calendar

[dataframe]相关推荐

Dataframe 连接两个数据帧
Dataframe Pyspark

Dataframe 如何在Julia中向数据帧插入缺少的值
Dataframe Julia

Dataframe Jupyter笔记本中的数据帧未返回完整元组内容
Dataframe

Dataframe SQL和withColumn之间的性能
Dataframe Apache Spark Pyspark

Dataframe 无法在Pypsark中为数据帧添加别名
Dataframe Pyspark

Dataframe 从旧数据框中提取的列名命名新数据框
Dataframe R

Dataframe 将Spark DF映射为（行编号、列编号、值）格式
Dataframe Apache Spark Pyspark

Dataframe Pyspark-通过比较不同数据帧中的值，根据条件更新数据帧
Dataframe Apache Spark Pyspark

Dataframe 使用Spark流时如何访问按时间排序的数据
Dataframe Pyspark

使用循环在pyspark dataframe中添加多列
Dataframe Pyspark

Dataframe 从dict的Dask系列创建Dask数据帧
Dataframe Dask

Dataframe Pyspark数据帧到配置单元表
Dataframe Pyspark

Tags

Ms Access Sublimetext2 Select Utf 8 Modelica Design Patterns Postman Mapping Facebook Xamarin.ios Autohotkey Url Ruby Itext Dynamic Omnet++ Gwt Keyboard Emacs Rdf Graphics Ignite Google App Maker Pyspark Protractor Gridview Install4j Leaflet System Verilog Scroll Twilio Sql Server 2012 Doxygen Model Sas Fullcalendar Php Tsql Certificate Webpack Pytorch Azure Data Factory Automation Nativescript Wcf Pandas Symfony1 Air Dialogflow Es Spring Log4j Primefaces Node.js Visual Studio Code Symfony Drupal 7 Css Scikit Learn Google Maps Api 3 Soap Grails Asp.net Mvc Serialization Atom Editor Memory Management Mips Reference Odata Sql Server 2005 Eclipse Rcp Kendo Ui Excel Formula Variables Batch File Map Telegram Mpi Audio Struct Openlayers Sql Neural Network List Jquery Ui Mongoose Memory Leaks Networking Sequelize.js Xaml Printing Google Colaboratory Angular6 Nuget Jhipster Project Management Http Keycloak Requirejs Ruby On Rails 4 Search Sql Server 2008 Notifications Model View Controller Spring Integration Speech Recognition Phpstorm Dns Ipad Visual Studio 2017 Nestjs Network Programming Ipython Types Android Studio Microservices Influxdb Azure Cosmosdb Sml Sitecore Ssl Opencart Iis Visual C++ Tree Filesystems Security Ckeditor Teradata Animation Sockets Ffmpeg Nest Groovy Google Cloud Dataflow Ruby On Rails Akka Woocommerce Database Stata Erlang Oracle10g Neo4j Recursion Debugging Android Layout Discord.py Serial Port Discord.js Arrays Google Cloud Storage Gcc Isabelle Asp.net Lisp Anaconda Identityserver4 Loops React Native Centos Dataframe Jupyter Notebook Sass Opengl Domain Driven Design Macos Hazelcast Mariadb Parse Platform Flask Botframework Post Maven Hibernate Maven 2 Google Cloud Firestore Function C++ Amazon Redshift Google Drive Api Wpf Asp.net Web Api Crystal Reports Sbt Sprite Kit Web Swiftui Rally Class Jsp Google Apps Script Tcl Nsis Gruntjs Typo3 Libgdx Visual Studio 2012 Permissions Devexpress Jpa Parameters Encoding

Copyright © 2024. All Rights Reserved by - Fatal编程技术网