Apache spark 如何找出读取的数据的总大小以及哪些数据属于Spark中的哪个节点_Apache Spark_Hadoop_Apache Spark Sql_Hdfs_Yarn

Apache spark 如何找出读取的数据的总大小以及哪些数据属于Spark中的哪个节点

apache-spark hadoop

Apache spark 如何找出读取的数据的总大小以及哪些数据属于Spark中的哪个节点,apache-spark,hadoop,apache-spark-sql,hdfs,yarn,Apache Spark,Hadoop,Apache Spark Sql,Hdfs,Yarn,假设我使用Apache spark读取如下数据集： City | Region | Population A | A1 | 150000 A | A2 | 50000 B | B1 | 250000 C | C1 | 350000 在此基础上创建dataframe之后，假设我基于城市对其重新分区。现在，如果我想知道我的spark cluster的哪个节点拥有城市A的信息，有可能知道吗？如果是的话

假设我使用Apache spark读取如下数据集：

City | Region |  Population 
A    |     A1  |     150000
A     |    A2    |   50000
B     |    B1    |   250000
C     |    C1     |  350000

在此基础上创建dataframe之后，假设我基于城市对其重新分区。现在，如果我想知道我的spark cluster的哪个节点拥有城市A的信息，有可能知道吗？如果是的话，请解释

请回答另一个问题，我如何知道spark作为数据帧读取的数据的总大小？

这里有几个问题

1.您希望查看每个节点正在处理的数据类型

 Here executor nodes would only perform the operations defined in the rdd or dataframe transformations to a chunk of data that is available in partitions in that executor node.

我认为检查节点内数据的最佳方法可能是为驱动程序和执行器启用日志记录，并在rdd/df操作中写入日志项。这些日志可以发布到执行器的本地磁盘，您需要连接到每个执行器节点，以验证属于每个节点的数据

 Here executor nodes would only perform the operations defined in the rdd or dataframe transformations to a chunk of data that is available in partitions in that executor node.

如果您想知道在dataframe中读取的dataframe的总大小，请参阅下面的内容