Python 将不同数据帧中的列字典转换为数据帧：pyspark_Python_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Dataframes

Python 将不同数据帧中的列字典转换为数据帧：pyspark

python apache-spark pyspark

Python 将不同数据帧中的列字典转换为数据帧：pyspark,python,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Python,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我正在尝试将来自不同数据帧的列合并到一个中进行分析。我正在把我需要的所有栏目都编入词典我现在有一本这样的字典- newDFDict = { 'schoolName': school.INSTNM, 'type': school.CONTROL, 'avgCostAcademicYear': costs.COSTT4_A, 'avgCostProgramYear': costs.COSTT4_P, 'averageNetPricePublic': co

我正在尝试将来自不同数据帧的列合并到一个中进行分析。我正在把我需要的所有栏目都编入词典

我现在有一本这样的字典-

newDFDict = {
    'schoolName': school.INSTNM,
    'type': school.CONTROL,
    'avgCostAcademicYear': costs.COSTT4_A, 
    'avgCostProgramYear': costs.COSTT4_P, 
    'averageNetPricePublic': costs.NPT4_PUB, 
}

{
 'schoolName': Column<b'INSTNM'>,
 'type': Column<b'CONTROL'>,
 'avgCostAcademicYear': Column<b'COSTT4_A'>,
 'avgCostProgramYear': Column<b'COSTT4_P'>,
 'averageNetPricePublic': Column<b'NPT4_PUB'>
}

有可能吗？如果可能，怎么做

这样做对吗？如果没有，我如何才能做到这一点

使用pandas不是一个选项，因为数据非常大（2-3 GB），而且pandas速度太慢。我正在本地机器上运行pyspark

提前感谢！：）

我建议有两种选择

选项1（构建字典的联合案例）：您说过，您有>=10个表（您想从中构建字典），这些表具有公共列（例如“schoolName”、“type”“avgCostAcademicYear”、“avgCostProgramYear”、“averageNetPricePublic”是公共列），然后您可以使用union或unionByName来形成单个合并表。查看数据

例如：

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df1

 union  

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df2
 ....
union
select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from dfN

select dictionary columns from table1,table2,table3,... tablen where join common columns in all tables (table1... tablen)

将为您提供词典的综合视图

选项2：（如果只有公共联接列）如果您有一些常见的联接列，那么无论存在多少个表，也可以使用标准联接

对于psuedo sql示例：

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df1

 union  

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df2
 ....
union
select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from dfN

select dictionary columns from table1,table2,table3,... tablen where join common columns in all tables (table1... tablen)

注意：遗漏任何连接列都将导致笛卡尔积

您可能已经使用公共键连接了2个数据帧，然后。。。。您可以选择所谓字典所需的列，对吗？两个数据帧都没有公共列？如果数据更多，则使用OOM收集数据是致命的。@RamGhadiyaram共有10个表，总计超过1900列。我在想，不用连接10个表，或者选择40-50个列，我们可以通过字典来实现。将所有公共列作为单个视图/数据帧进行连接或联合有什么不对？大不收。。。你把所有的数据都交给司机，所以我不明白你想说什么。此外，我还应该提到唯一的公共键是行数。请看我的示例，如果您没有连接键，您可以使用union，答案如下，这是公共dict字段的sql表示。唯一的公共列是行数。没有共同的专栏可以不加入你的意思是工会不起作用？您试过了吗？在使用monoticy_increasing_id（）创建了一个新的数据帧，其中包含reqd列，而不是在从每个表中选择列之后创建字典。因此，选项2单调递增的id不是递增的，因为sql行数是递增的。请仔细查看我的答案，我计划无论如何都放弃它。它不会在join中引起问题，对吗？（我询问的是不匹配的联接）