Scala Spark中数据帧列的拆分和过滤
我正在与apache spark合作。我有下面的txt文件Scala Spark中数据帧列的拆分和过滤,scala,apache-spark,Scala,Apache Spark,我正在与apache spark合作。我有下面的txt文件 05:49:56.604899 00:00:00:00:00:02 > 00:00:00:00:00:03, ethertype IPv4 (0x0800), length 10202: 10.0.0.2.54880 > 10.0.0.3.5001: Flags [.], seq 3641977583:3641987719, ack 129899328, win 58, options [nop,nop,TS val 432
05:49:56.604899 00:00:00:00:00:02 > 00:00:00:00:00:03, ethertype IPv4 (0x0800), length 10202: 10.0.0.2.54880 > 10.0.0.3.5001: Flags [.], seq 3641977583:3641987719, ack 129899328, win 58, options [nop,nop,TS val 432623 ecr 432619], length 10136
05:49:56.604908 00:00:00:00:00:03 > 00:00:00:00:00:02, ethertype IPv4 (0x0800), length 66: 10.0.0.3.5001 > 10.0.0.2.54880: Flags [.], ack 10136, win 153, options [nop,nop,TS val 432623 ecr 432623], length 0
05:49:56.604900 00:00:00:00:00:02 > 00:00:00:00:00:03, ethertype IPv4 (0x0800), length 4410: 10.0.0.2.54880 > 10.0.0.3.5001: Flags [P.], seq 10136:14480, ack 1, win 58, options [nop,nop,TS val 432623 ecr 432619], length 4344
现在我想从文件中提取IP和时间戳。例如,输出应如下所示:
05:49: 56.604899 10.0.0.2 54880 10.0.0.3 5001
05:49: 56.604908 10.0.0.3 5001 10.0.0.2 54880
05:49: 56.604900 10.0.0.2 54880 10.0.0.3 5001
以下是我使用的代码:
object ML_Test {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("saeed_test").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.schema(customSchema)
.load("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
val selectedData = df.select("column0", "column1", "column2")
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/Users/saeedtkh/Desktop/sharedsaeed/tempoutput.txt")
}
}
但是,我可以提取以下结果。(我也无法在此应用拆分功能)
你能帮我修改这个代码来做上面的结果吗。请帮帮我
更新1:
当我尝试执行第一个答案时,一些字段在我的想法中无法识别:
[![在此处输入图像描述][1][1]
我添加了以下库,问题得到了解决:
import org.apache.spark.sql.Row
更新2:
根据答案一,当我在我的想法中运行代码时,我得到了一个空文件夹。(过程结束,退出代码为1)
错误是:
17/05/24 09:45:52 ERROR Utils: Aborting task
java.lang.ArrayIndexOutOfBoundsException: 2
at ML_Test$$anonfun$2.apply(ML_Test.scala:28)
at ML_Test$$anonfun$2.apply(ML_Test.scala:25)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:254)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/24 09:45:52 ERROR DefaultWriterContainer: Task attempt attempt_201705240945_0000_m_000001_0 aborted.
17/05/24 09:45:52 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at ML_Test$$anonfun$2.apply(ML_Test.scala:28)
at ML_Test$$anonfun$2.apply(ML_Test.scala:25)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:254)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
以下是您的解决方案
val customSchema = StructType(Array(
StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
val rdd = sc.textFile("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
val rowRdd =rdd.map(line => line.split(">")).map(array => {
val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
Row.fromSeq(Seq(first, second, third))
})
val dataFrame = sqlContext.createDataFrame(rowRdd, customSchema)
val selectedData = dataFrame.select("column0", "column1", "column2")
import org.apache.spark.sql.SaveMode
selectedData.write
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/Users/saeedtkh/Desktop/sharedsaeed/tempoutput.txt")
以下是您的解决方案
val customSchema = StructType(Array(
StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
val rdd = sc.textFile("/Users/saeedtkh/Desktop/sharedsaeed/train.csv")
val rowRdd =rdd.map(line => line.split(">")).map(array => {
val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
Row.fromSeq(Seq(first, second, third))
})
val dataFrame = sqlContext.createDataFrame(rowRdd, customSchema)
val selectedData = dataFrame.select("column0", "column1", "column2")
import org.apache.spark.sql.SaveMode
selectedData.write
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/Users/saeedtkh/Desktop/sharedsaeed/tempoutput.txt")
我想这就是你需要的。这可能不是最有效的解决方案,但它确实有效
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val customSchema = StructType(
Array(StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
val data = spark.read
.schema(schema = customSchema)
.csv(
"tempoutput.txt")
data
.withColumn("column0", split($"column0", " "))
.withColumn("column1", split($"column2", " "))
.withColumn("column2", split($"column2", " "))
.select(
$"column0".getItem(0).as("column0"),
$"column1".getItem(3).as("column1"),
$"column2".getItem(5).as("column2")
)
.show()
我想这就是你需要的。这可能不是最有效的解决方案,但它确实有效
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val customSchema = StructType(
Array(StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
val data = spark.read
.schema(schema = customSchema)
.csv(
"tempoutput.txt")
data
.withColumn("column0", split($"column0", " "))
.withColumn("column1", split($"column2", " "))
.withColumn("column2", split($"column2", " "))
.select(
$"column0".getItem(0).as("column0"),
$"column1".getItem(3).as("column1"),
$"column2".getItem(5).as("column2")
)
.show()
谢谢你的回答。然而,当我试图执行您的代码时,出现了一些错误。我刚刚更新了问题并把错误放在那里。我认为这是因为库和独立性,对吗?你必须导入一个库
import org.apache.spark.sql.Row
你能发布所有的错误行吗?您是否尝试查看RDD
或DataFrame
在read
和write
之间是否有RDD.foreach(println)
或DataFrame.show(false)
数据?并请张贴所有的错误行。代码在我的中成功运行,我所做的只是更改输入和输出路径。在这种情况下,您应该使用Try
getOrElse
。我已经更新了我的答案。请试一试。您的输入数据不是常量。所以你得到了ArrayIndexOutofBoundException错误,我又更新了我的答案。现在,您不需要每次运行时都删除输出目录。它将覆盖。请接受更新。如果有问题,请告诉我谢谢你的回答。然而,当我试图执行您的代码时,出现了一些错误。我刚刚更新了问题并把错误放在那里。我认为这是因为库和独立性,对吗?你必须导入一个库import org.apache.spark.sql.Row
你能发布所有的错误行吗?您是否尝试查看RDD
或DataFrame
在read
和write
之间是否有RDD.foreach(println)
或DataFrame.show(false)
数据?并请张贴所有的错误行。代码在我的中成功运行,我所做的只是更改输入和输出路径。在这种情况下,您应该使用Try
getOrElse
。我已经更新了我的答案。请试一试。您的输入数据不是常量。所以你得到了ArrayIndexOutofBoundException错误,我又更新了我的答案。现在,您不需要每次运行时都删除输出目录。它将覆盖。请接受更新。如果存在问题,请告诉我导入Row
libraryimport org.apache.spark.sql.Row
您能发布所有错误行吗?请导入Row
libraryimport org.apache.spark.sql.Row
您能发布所有错误行吗?谢谢,但该想法在$中发现了错误。有什么特别的库可以添加吗?谢谢,但是这个想法在$中发现了错误。有什么特别的图书馆要加吗?