如何使用Spark将文件名作为列添加到Java RDD中?

如何使用Spark将文件名作为列添加到Java RDD中?,java,dataframe,apache-spark,pyspark,apache-spark-sql,Java,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,使用ApacheSpark,我们需要处理一组文件,并跟踪哪些文件中有特定的关键字 我正在尝试创建包含两列的数据帧: 文件中的行 包含该行的文件 以下是我目前掌握的情况: String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).

使用ApacheSpark,我们需要处理一组文件,并跟踪哪些文件中有特定的关键字

我正在尝试创建包含两列的数据帧:

  • 文件中的行
  • 包含该行的文件
以下是我目前掌握的情况:

String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).toArray((new String[0]));
SparkSession spark = SparkSession.builder().appName("LogSearcher").master("local").getOrCreate();

// sourceLogPaths is an array of different file names
JavaRDD<String> textFile = spark.read().textFile(sourceLogPaths).javaRDD();
JavaRDD<Row> rowRDD = textFile.map(RowFactory::create);
// How to add a field that shows the associated filename for each row?
List<StructField> fields = Arrays.asList(DataTypes.createStructField("line", DataTypes.StringType, true)); 
StructType schema = DataTypes.createStructType(fields);
SQLContext sqlContext = spark.sqlContext();
Dataset<Row> df = sqlContext.createDataFrame(rowRDD, schema);

df.show();
有人能帮我理解如何将原始文件的名称添加为第二列吗

寻求建议导致了建议,但我不确定在这种情况下如何翻译


提前感谢您,我是Spark的新手,非常感谢您的建议。

我不是Java高手,但在使用Spark的python中,您可以提供完整的文件夹或文件模式,并使用下面类似的内容。。如果本地文件系统使用文件:在前面。文件名将向数据中添加文件名

df = spark.read.text('/datafolder/foldername/*')
df = df.withColumn("filename", input_file_name())

感谢@Rafa,我想这就是答案:

String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).toArray((new String[0]));
SparkSession spark = SparkSession.builder().appName("LogSearcher").master("local").getOrCreate();

// sourceLogPaths is an array of different file names
JavaRDD<String> textFile = spark.read().textFile(sourceLogPaths).javaRDD();
JavaRDD<Row> rowRDD = textFile.map(RowFactory::create);
// How to add a field that shows the associated filename for each row?
List<StructField> fields = Arrays.asList(DataTypes.createStructField("line", DataTypes.StringType, true)); 
StructType schema = DataTypes.createStructType(fields);
SQLContext sqlContext = spark.sqlContext();

// Below line has the additional column added
Dataset<Row> df = sqlContext.createDataFrame(rowRDD, schema).withColumn("file_name", input_file_name());

df.show();


不是java spark专家,但我想尝试一下。你能做一些类似于
spark.read().textFile(sourcelogpath).withColumn(“filename”,input_file_name()).javaRDD()的事情吗?我很感激!谢谢,我想这很有帮助
String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).toArray((new String[0]));
SparkSession spark = SparkSession.builder().appName("LogSearcher").master("local").getOrCreate();

// sourceLogPaths is an array of different file names
JavaRDD<String> textFile = spark.read().textFile(sourceLogPaths).javaRDD();
JavaRDD<Row> rowRDD = textFile.map(RowFactory::create);
// How to add a field that shows the associated filename for each row?
List<StructField> fields = Arrays.asList(DataTypes.createStructField("line", DataTypes.StringType, true)); 
StructType schema = DataTypes.createStructType(fields);
SQLContext sqlContext = spark.sqlContext();

// Below line has the additional column added
Dataset<Row> df = sqlContext.createDataFrame(rowRDD, schema).withColumn("file_name", input_file_name());

df.show();

+--------------------+--------------------+
|                line|           file_name|
+--------------------+--------------------+
|1331901000.000000...|file:///Users/acu...|
|1331901000.000000...|file:///Users/acu...|
|1331901000.000000...|file:///Users/acu...|
|1331901000.010000...|file:///Users/acu...|