Apache spark RDD：重新分区时保留总顺序_Apache Spark_Pyspark_Rdd - Fatal编程技术网

Apache spark RDD：重新分区时保留总顺序

apache-spark pyspark

Apache spark RDD：重新分区时保留总顺序,apache-spark,pyspark,rdd,Apache Spark,Pyspark,Rdd,关于RDDs（）中的顺序，我的一个假设似乎是不正确的假设我希望在对RDD进行排序后对其重新分区 import random l = list(range(20)) random.shuffle(l) spark.sparkContext\ .parallelize(l)\ .sortBy(lambda x:x)\ .repartition(3)\ .collect() 这将产生： [16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,

关于RDDs（）中的顺序，我的一个假设似乎是不正确的

假设我希望在对RDD进行排序后对其重新分区

import random

l = list(range(20))
random.shuffle(l)

spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x)\
.repartition(3)\
.collect()

这将产生：

[16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

正如我们所看到的，顺序在一个分区内保持，但总的顺序并不是在所有分区上都保持
我希望保留RDD的总顺序，如下所示：

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

我在网上很难找到任何可以帮助我的东西。非常感谢您的帮助。
我们似乎可以向
sortBy
函数提供参数
numPartitions=partitions
，以对RDD进行分区并保留总顺序：

import random l = list(range(20)) random.shuffle(l) partitions = 3 spark.sparkContext\ .parallelize(l)\ .sortBy(lambda x:x ,numPartitions=partitions)\ .collect()

使用coalesce（1）将其设为单分区@Kishore，我处理数十亿行，因此很遗憾，这不起作用。重新分区后排序不是一个选项吗？@shaido，当然会。它会保留分区吗？

[pyspark]相关文章推荐

Pyspark 如何一次运行多个Spark 2.0实例（在多个Jupyter笔记本中）？ pyspark jupyter-notebook

Pyspark 为什么Pypark给出了错误的方差值？ pyspark

运行pyspark时出错 pyspark

在pyspark中使用groupby时无法获取所有列 pyspark

如何使用createDataFrame创建pyspark数据帧？ pyspark

Pyspark 火花流到功率BI pyspark hbase powerbi

Sagemaker PySpark：内核已死亡 pyspark

Pyspark Pypark concat多个柱，聚结不起作用 pyspark

pyspark-kafka流数据处理程序 pyspark

Pyspark 如何在DataRicks中读取数据包含双引号和逗号分隔的csv文件 pyspark

如何展平每个id包含多行的pyspark数据帧？ pyspark

pyspark：将字符串转换为日期格式，不含分钟、编码和小时 pyspark

在pyspark中运行sql查询时获取pyspark.sql.utils.ParseException pyspark

使用pyspark从databricks中删除红移表 pyspark amazon-redshift

Pyspark 从HDFS读取拼花地板和模式问题 pyspark

获取pyspark中与最新时间戳对应的行 pyspark cassandra

PySpark/计算出现次数，并使用UDF创建一个新列 pyspark

pySpark将camelCase字符串拆分为两个字符串 pyspark

Pyspark 从数据帧名称中删除数字 pyspark

随机文章推荐

Performance 为什么在MATLAB中缓存答案需要更长的时间？ performance matlab

Performance 编写傻瓜式应用程序与编写性能应用程序 performance

Performance SQl 2008在同一物理机和同一服务器实例上的跨数据库性能 performance sql-server-2008

Performance 在firefox的html5画布游戏中追踪生成tilemap的瓶颈/错误 performance html firefox

Performance 使循环更快 performance r

Performance UtableView实现，其中避免UtableViewCell重用是合理的 performance ios5 uitableview

Performance 如何在mongodb中仅对整数使用地理索引？（就像一块有瓷砖的游戏板） performance mongodb

Performance 如何基于用户对对象的访问筛选视图 performance

Performance 有没有任何有效的Pig累加器接口实现的好例子？ performance apache-pig

Performance 多级数据集的Hibernate性能问题 performance hibernate

Performance 初始加载后，我如何测试网站的性能？ performance web-applications

Performance Neo4j-慢速密码查询-带层次结构的大图 performance graph neo4j

Performance 了解复杂SQL联接的性能 performance sqlite join big-o

Performance 加快matplotlib中的打印日期 performance date matplotlib

Performance 如何在amChart库中设置鼠标滚轮的缩放速度？ performance

Performance 为什么这个简单的ArangoDB查询有时需要很长时间 performance arangodb

Performance 不同情况下的触发性能 performance sql-server-2005 triggers

Performance docker用相似的图像进行合成 performance docker docker-compose

Performance 如何将入站thruput解锁到elasticsearch（7.8） performance networking

Performance 了解蝗虫用户和spwan率 performance

[apache spark]相关推荐

Apache spark spark流媒体中的广播变量Null指针异常
Apache Spark

Apache spark Spark sql：如何计算双值
Apache Spark

Apache spark 我正在使用spark 1.4并尝试使用压缩snappy保存为orcfile，但它保存为zlib
Apache Spark

Apache spark 火花执行器闭合故障
Apache Spark

Apache spark 如何在Python中排除Spark dataframe中的多列
Apache Spark Dataframe Pyspark

Apache spark org.apache.spark.sql.AnalysisException:无法解析给定的输入列
Apache Spark Dataframe

Apache spark 在斯巴达克，它是'；功能部件的数量'；也指'；因子数'；？
Apache Spark

Apache spark 用蟒蛇在工人身上用火花纱固定蟒蛇
Apache Spark Pyspark

Apache spark 创建复合变压器火花
Apache Spark

Apache spark Oracle到Cassandra数据迁移的数据验证
Apache Spark

Apache spark 如何将可在运行时定义的规则应用于流数据集？
Apache Spark

Apache spark 如何按列值训练单独的模型？
Apache Spark Pyspark

Apache spark 来自Docker的Neo4j和Spark的ServiceUnavailableException
Apache Spark Docker Neo4j

Apache spark Spark:java.io.InvalidClassException
Apache Spark Cassandra

Apache spark 如何使用spark同时处理多个文件
Apache Spark Apache Kafka

Apache spark 分区表上的配置单元增量
Apache Spark Hive

Apache spark spark saveAsTable真的创建了一个表吗？
Apache Spark Hive

Apache spark 有办法在加载时缓存吗？
Apache Spark

Apache spark 单火花任务是否曾经是多线程的？
Apache Spark

Apache spark 使用spark dataframe和自定义分区器连接的技术可以使用python，但不能使用scala？
Apache Spark Join

Apache spark 按条件应用UDF的优雅方式
Apache Spark

Apache spark 通过PySpark设置配置单元属性
Apache Spark Hadoop Hive Pyspark

Apache spark 创建Dataframe时如何忽略DataRicks中的空拼花文件
Apache Spark

Apache spark streamingContext无法解析为变量
Apache Spark Apache Kafka

Apache spark 需要使用dataframe pivot转换spark dataframe的帮助吗
Apache Spark

Apache spark 在Windows10上，没有Databricks但有WASB的情况下，如何从带有ApacheSpark的Azure Blob读取文件？
Apache Spark Hadoop

Apache spark 在SPARK for linux集群中作为资源管理器的纱线-Kubernetes内部和外部
Apache Spark Hadoop Kubernetes

Apache spark 如何对pyspark数据块中的读取流数据应用自定义函数
Apache Spark Pyspark

Apache spark Databricks REST API节流和容量限制/限制
Apache Spark

Apache spark 在减少分区数量时，为什么spark数据帧重新分区比coalesce快？
Apache Spark

Tags

Continuous Integration Testng Error Handling Woocommerce Knockout.js Google Compute Engine Artifactory Office365 Cypress Csv Gcc Android Emulator Apache Flex Twilio Passwords Ag Grid Magento2 Cordova Migration Heroku Hive Mapbox Symfony1 Routing Jasper Reports Plone Python Akka Glsl Video Java Printing Web Services Alfresco Magento Computer Vision Ms Word Python 2.7 Jwt Generics Rally Batch File Apache Pig Docker Asp.net Web Api Amazon Dynamodb Junit Maven Nunit Binary Ios5 Internationalization System Verilog Fullcalendar Msbuild Socket.io .net 4.0 Scheme Webgl Xsd Robotframework Struct Awk Select Cmd Installation Data Structures Ms Access Google Cloud Dataflow Kubernetes Pdf Tsql Url Rewriting Rss Gtk Wix F# Clang Parallel Processing Design Patterns Tridion Stored Procedures Vuejs2 Grid Java 8 Playframework 2.0 Angularjs Visual Studio Code Linq Hybris Jar Gwt Windows Phone 8 Parse Platform Vim Netlogo Grails Composer Php Serial Port Highcharts Eclipse Plugin Build Salesforce Boost Time Complexity Chart.js Swift2 Jmeter Openlayers 3 Soap Latex Swift Puppet Jpa Scala Localization Seo Jetty Sitecore Version Control File Upload Stream Node.js Dynamic Here Api Windows 7 Sqlite Phantomjs Http Google Chrome Devtools Uitableview Oauth Jqgrid Log4net Powershell Hyperlink Cluster Computing Markdown Plsql Racket Arrays Airflow Bots Windows Installer Sharepoint 2013 Laravel 3d Safari Phpmyadmin Process Templates Jms User Interface Django Rest Framework Linq To Sql Aem Intellij Idea Vb.net Less Mqtt Apache Spark Tinymce Objective C Google Bigquery Email Swift3 Xpath Groovy Map Drools Graphics Properties Shopify Tree Session Core Data Wolfram Mathematica Parameters Date Artificial Intelligence Bootstrap 4 Openstack Sonarqube Dynamics Crm 2011 Vue.js Notepad++ String Silverlight Responsive Design Colors Office Js Websphere Loops Numpy Jsf 2 Discord Embedded Security Azure Data Factory Yaml Identityserver4

Copyright © 2024. All Rights Reserved by - Fatal编程技术网