如何使用Spark将文件名作为列添加到Java RDD中?
使用ApacheSpark,我们需要处理一组文件,并跟踪哪些文件中有特定的关键字 我正在尝试创建包含两列的数据帧:如何使用Spark将文件名作为列添加到Java RDD中?,java,dataframe,apache-spark,pyspark,apache-spark-sql,Java,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,使用ApacheSpark,我们需要处理一组文件,并跟踪哪些文件中有特定的关键字 我正在尝试创建包含两列的数据帧: 文件中的行 包含该行的文件 以下是我目前掌握的情况: String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).
- 文件中的行
- 包含该行的文件
String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).toArray((new String[0]));
SparkSession spark = SparkSession.builder().appName("LogSearcher").master("local").getOrCreate();
// sourceLogPaths is an array of different file names
JavaRDD<String> textFile = spark.read().textFile(sourceLogPaths).javaRDD();
JavaRDD<Row> rowRDD = textFile.map(RowFactory::create);
// How to add a field that shows the associated filename for each row?
List<StructField> fields = Arrays.asList(DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
SQLContext sqlContext = spark.sqlContext();
Dataset<Row> df = sqlContext.createDataFrame(rowRDD, schema);
df.show();
有人能帮我理解如何将原始文件的名称添加为第二列吗
寻求建议导致了建议,但我不确定在这种情况下如何翻译
提前感谢您,我是Spark的新手,非常感谢您的建议。我不是Java高手,但在使用Spark的python中,您可以提供完整的文件夹或文件模式,并使用下面类似的内容。。如果本地文件系统使用文件:在前面。文件名将向数据中添加文件名
df = spark.read.text('/datafolder/foldername/*')
df = df.withColumn("filename", input_file_name())
感谢@Rafa,我想这就是答案:
String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).toArray((new String[0]));
SparkSession spark = SparkSession.builder().appName("LogSearcher").master("local").getOrCreate();
// sourceLogPaths is an array of different file names
JavaRDD<String> textFile = spark.read().textFile(sourceLogPaths).javaRDD();
JavaRDD<Row> rowRDD = textFile.map(RowFactory::create);
// How to add a field that shows the associated filename for each row?
List<StructField> fields = Arrays.asList(DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
SQLContext sqlContext = spark.sqlContext();
// Below line has the additional column added
Dataset<Row> df = sqlContext.createDataFrame(rowRDD, schema).withColumn("file_name", input_file_name());
df.show();
不是java spark专家,但我想尝试一下。你能做一些类似于
spark.read().textFile(sourcelogpath).withColumn(“filename”,input_file_name()).javaRDD()的事情吗代码>?我很感激!谢谢,我想这很有帮助
String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).toArray((new String[0]));
SparkSession spark = SparkSession.builder().appName("LogSearcher").master("local").getOrCreate();
// sourceLogPaths is an array of different file names
JavaRDD<String> textFile = spark.read().textFile(sourceLogPaths).javaRDD();
JavaRDD<Row> rowRDD = textFile.map(RowFactory::create);
// How to add a field that shows the associated filename for each row?
List<StructField> fields = Arrays.asList(DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
SQLContext sqlContext = spark.sqlContext();
// Below line has the additional column added
Dataset<Row> df = sqlContext.createDataFrame(rowRDD, schema).withColumn("file_name", input_file_name());
df.show();
+--------------------+--------------------+
| line| file_name|
+--------------------+--------------------+
|1331901000.000000...|file:///Users/acu...|
|1331901000.000000...|file:///Users/acu...|
|1331901000.000000...|file:///Users/acu...|
|1331901000.010000...|file:///Users/acu...|