Dataframe Databricks spark.read csv有行#torefresh_Dataframe_Apache Spark_Databricks

Dataframe Databricks spark.read csv有行#torefresh

dataframe apache-spark

Dataframe Databricks spark.read csv有行#torefresh,dataframe,apache-spark,databricks,Dataframe,Apache Spark,Databricks,我将把csv读入数据帧 1.我创建结构 2.加载csv spark.read.option（“header”，“false”）.schema（schema）.option（'delimiter'，'，'））.option（'mode'，'PERMISSIVE'）.csv（path1）

我将把csv读入数据帧 1.我创建结构 2.加载csv spark.read.option（“header”，“false”）.schema（schema）.option（'delimiter'，'，'））.option（'mode'，'PERMISSIVE'）.csv（path1）

如何检查哪些文件/哪些行获得了#torefresh and null…？

要知道哪些文件包含这些行，您可以使用

input\u file\u name

函数从

pyspark.sql.functions

e、 g

这样，您还可以轻松地获得每个文件一行的聚合

df.where("col1 == '#torefresh'").withColumn("file", input_file_name()).groupBy("file").count().show()

+--------------------+-----+
|                file|count|
+--------------------+-----+
|file:///C:/Users/...|  119|
|file:///C:/Users/...|  131|
|file:///C:/Users/...|  118|
|file:///C:/Users/...|  127|
|file:///C:/Users/...|  125|
|file:///C:/Users/...|  116|
+--------------------+-----+

我不知道有什么好的spark方法可以在原始文件中找到行号——当您将csv加载到数据帧中时，这些信息几乎丢失了。有一个

row\u number

函数，但它在窗口上工作，因此数字将取决于您定义窗口分区/排序的方式

如果您使用的是本地文件系统，则可以尝试再次手动读取csv并查找行号，如下图所示：

import csv
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf(returnType=ArrayType(StringType()))
def getMatchingRows(filePath):
    with open(filePath.replace("file:///", ""), 'r') as file:
      reader = csv.reader(file)
      matchingRows = [index for index, line in enumerate(reader) if line[0] == "#torefresh"]
      return matchingRows

withRowNumbers = df.where("col1 == '#torefresh'")\
    .withColumn("file", input_file_name())\
    .groupBy("file")\
    .count()\
    .withColumn("rows", getMatchingRows("file"))
withRowNumbers.show()

+--------------------+-----+--------------------+
|                file|count|                rows|
+--------------------+-----+--------------------+
|file:///C:/Users/...|  119|[1, 2, 4, 5, 6, 1...|
|file:///C:/Users/...|  131|[1, 2, 3, 6, 7, 1...|
|file:///C:/Users/...|  118|[1, 2, 3, 4, 5, 7...|
|file:///C:/Users/...|  127|[1, 2, 3, 4, 5, 7...|
|file:///C:/Users/...|  125|[1, 2, 3, 5, 6, 7...|
|file:///C:/Users/...|  116|[1, 2, 3, 5, 7, 8...|
+--------------------+-----+--------------------+

但这将是非常低效的，如果您希望在许多文件中都有这些行，那么就没有必要使用数据帧。我建议您处理数据源，并在创建时启用某种ID，但当然，除非您需要知道的是该文件包含任何ID

如果除了知道第一个值是“#torefresh”，还需要所有其他值为null，则可以扩展

where

过滤器和自定义项中的手动检查

import csv
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf(returnType=ArrayType(StringType()))
def getMatchingRows(filePath):
    with open(filePath.replace("file:///", ""), 'r') as file:
      reader = csv.reader(file)
      matchingRows = [index for index, line in enumerate(reader) if line[0] == "#torefresh"]
      return matchingRows

withRowNumbers = df.where("col1 == '#torefresh'")\
    .withColumn("file", input_file_name())\
    .groupBy("file")\
    .count()\
    .withColumn("rows", getMatchingRows("file"))
withRowNumbers.show()

+--------------------+-----+--------------------+
|                file|count|                rows|
+--------------------+-----+--------------------+
|file:///C:/Users/...|  119|[1, 2, 4, 5, 6, 1...|
|file:///C:/Users/...|  131|[1, 2, 3, 6, 7, 1...|
|file:///C:/Users/...|  118|[1, 2, 3, 4, 5, 7...|
|file:///C:/Users/...|  127|[1, 2, 3, 4, 5, 7...|
|file:///C:/Users/...|  125|[1, 2, 3, 5, 6, 7...|
|file:///C:/Users/...|  116|[1, 2, 3, 5, 7, 8...|
+--------------------+-----+--------------------+