Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Spark上的Scala分割数据帧中的字符串_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

使用Spark上的Scala分割数据帧中的字符串

使用Spark上的Scala分割数据帧中的字符串,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个日志文件,它有100多列。其中我只需要两列“_raw”和“_time”,所以我创建了一个日志文件,并将其加载为“csv”DF 步骤1: scala> val log = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("soa_prod_diag_10_jan.csv") log: org.apache.spark.sql.DataFrame = [ARRAff

我有一个日志文件,它有100多列。其中我只需要两列“_raw”和“_time”,所以我创建了一个日志文件,并将其加载为“csv”DF

步骤1:

scala> val log = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("soa_prod_diag_10_jan.csv")
log: org.apache.spark.sql.DataFrame = [ARRAffinity: string, CoordinatorNonSecureURL: string ... 126 more fields]
步骤2: 我将DF注册为临时表
log.createOrReplaceTempView(“日志”

步骤3:我提取了两个必需的列“\u raw”和“\u time”

scala> val sqlDF = spark.sql("select _raw, _time from logs")
sqlDF: org.apache.spark.sql.DataFrame = [_raw: string, _time: string]

scala> sqlDF.show(1, false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|_raw                                                                                                                                                                                                                                                                                                                                                                                                |_time|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|[2019-01-10T23:59:59.998-06:00] [xx_yyy_zz_sss_ra10] [ERROR] [OSB-473003] [oracle.osb.statistics.statistics] [tid: [ACTIVE].ExecuteThread: '28' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b,0] [partition-name: DOMAIN] [tenant-name: GLOBAL] Aggregation Server Not Available. Failed to get remote aggregator[[|null |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
only showing top 1 row
scala>val sqlDF=spark.sql(“从日志中选择原始时间”)
sqlDF:org.apache.spark.sql.DataFrame=[\u原始:字符串,\u时间:字符串]
scala>sqlDF.show(1,false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|_生的||u时间|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|[2019-01-10T23:59:59.998-06:00][xx_yyy_zz_sss_ra10][ERROR][OSB-473003][oracle.OSB.statistics.statistics][tid:[ACTIVE].ExecuteThread:'28'队列:'weblogic.kernel.Default(self-tuning)][userId:[ecid:92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000b,0][分区名称:域][租户名称:全局]聚合服务器不可用。无法获取远程聚合器[[| null|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
仅显示前1行
我的要求:

我需要拆分“_raw”列中的字符串以生成 [2019-01-10T23:59:59.998-06:00][xx_yyy_zz_sss_ra10][ERROR][OSB-473003][oracle.OSB.statistics.statistics][ecid:92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b],列名称分别为a、b、c、c、d、e、f

同时从“\u raw”和“\u time”中删除所有空值

scala> val sqlDF = spark.sql("select _raw, _time from logs")
sqlDF: org.apache.spark.sql.DataFrame = [_raw: string, _time: string]

scala> sqlDF.show(1, false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|_raw                                                                                                                                                                                                                                                                                                                                                                                                |_time|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|[2019-01-10T23:59:59.998-06:00] [xx_yyy_zz_sss_ra10] [ERROR] [OSB-473003] [oracle.osb.statistics.statistics] [tid: [ACTIVE].ExecuteThread: '28' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b,0] [partition-name: DOMAIN] [tenant-name: GLOBAL] Aggregation Server Not Available. Failed to get remote aggregator[[|null |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
only showing top 1 row

您的答案将不胜感激:)

您可以拆分函数,然后按空格拆分原始数据。这将返回一个数组,然后您可以从该数组中提取值。您还可以使用regexp\u extract函数从日志消息中提取值。这两种方法如下所示。希望对您有所帮助

//Creating Test Data
val df = Seq("[2019-01-10T23:59:59.998-06:00] [xx_yyy_zz_sss_ra10] [ERROR] [OSB-473003] [oracle.osb.statistics.statistics] [tid: [ACTIVE].ExecuteThread: '28' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b,0] [partition-name: DOMAIN] [tenant-name: GLOBAL] Aggregation Server Not Available. Failed to get remote aggregator[[")
  .toDF("_raw")

val splitDF = df.withColumn("split_raw_arr", split($"_raw", " "))
  .withColumn("A", $"split_raw_arr"(0))
  .withColumn("B", $"split_raw_arr"(1))
  .withColumn("C", $"split_raw_arr"(2))
  .withColumn("D", $"split_raw_arr"(3))
  .withColumn("E", $"split_raw_arr"(4))
  .drop("_raw", "split_raw_arr")

splitDF.show(false)

+-------------------------------+--------------------+-------+------------+----------------------------------+
|A                              |B                   |C      |D           |E                                 |
+-------------------------------+--------------------+-------+------------+----------------------------------+
|[2019-01-10T23:59:59.998-06:00]|[xx_yyy_zz_sss_ra10]|[ERROR]|[OSB-473003]|[oracle.osb.statistics.statistics]|
+-------------------------------+--------------------+-------+------------+----------------------------------+

val extractedDF = df
  .withColumn("a", regexp_extract($"_raw", "\\[(.*?)\\]",1))
  .withColumn("b", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\]",2))
  .withColumn("c", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",3))
  .withColumn("d", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",4))
  .withColumn("e", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",5))
  .withColumn("f", regexp_extract($"_raw", "(?<=ecid: )(.*?)(?=,)",1))
  .drop("_raw")

+-----------------------------+------------------+-----+----------+--------------------------------+---------------------------------------------+
|a                            |b                 |c    |d         |e                               |f                                            |
+-----------------------------+------------------+-----+----------+--------------------------------+---------------------------------------------+
|2019-01-10T23:59:59.998-06:00|xx_yyy_zz_sss_ra10|ERROR|OSB-473003|oracle.osb.statistics.statistics|92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b|
+-----------------------------+------------------+-----+----------+--------------------------------+---------------------------------------------+
//创建测试数据
val df=Seq(“[2019-01-10T23:59:59.998-06:00][xx_yyy_zz_sss_ra10][ERROR][OSB-473003][oracle.OSB.statistics.statistics][tid:[ACTIVE].ExecuteThread:'28'队列:'weblogic.kernel.Default(自调整)][userId:[ecid:92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b,0][分区名称:域名][tenant:GLOBAL]聚合服务器不可用。无法获取远程聚合器[“”)
.toDF(“原始”)
val splitDF=df.withColumn(“split_raw_arr”,split($“_raw)”,“”)
.withColumn(“A”列,$“分割原始数据”(0))
.带列(“B”和$“分割原始数据”(1))
.带列(“C”和“$”拆分为原始数据(2))
.带“D”列,$“分割原始数据”(3))
.带列(“E”和$“分割原始数据”(4))
.drop(“\u raw”,“split\u raw\u arr”)
splitDF.show(false)
+-------------------------------+--------------------+-------+------------+----------------------------------+
|A | B | C | D | E|
+-------------------------------+--------------------+-------+------------+----------------------------------+
|[2019-01-10T23:59:59.998-06:00]|[xx_yyy_zz_sss_ra10]|[ERROR]|[OSB-473003]|[oracle.OSB.statistics.statistics]|
+-------------------------------+--------------------+-------+------------+----------------------------------+
val extractedDF=df
.withColumn(“a”,regexp\u extract($“\u raw”,“\\[(.*?\\]),1))
.withColumn(“b”,regexp\u extract($“\u raw”,“\\[(.*?\\]\[(.*?\\])”,2))
.withColumn(“c”,regexp\u extract($“\u raw)”,“\\[(.*?\\]\[(.*?\\]\[(.*?\\]\\[(.*?\\])”,3))
。使用列(“d”,regexp\u extract($“\u raw”,“\\[(.*?\\]\[(.*?\\]\[(.*?\\]\[(.*?\\]\[(.*?\]]),4))
。使用列(“e”,regexp\u extract($“\u raw”),“\\[(.*?\\]\[(.*?\\]\[(.*?\\]\[(.*?\\]\[(.*?\\]),5))

。使用列(“f”,regexp_extract($“_raw”,”)(?请在[2019-01-10T23:59:59.998-06:00][xx_yyy_zz_sss_ra10][ERROR][OSB-473003][oracle.OSB.statistics.statistics][ecid:92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000b]列中添加一个完整的日志字符串,并注明所需的时间[2019-01-10T23:59:59.998-06:00]B列[xx_yyy_zz_sss_ra10]C列[ERROR]D列[OSB-473003]E列[oracle.OSB.statistics.statistics]F列[ecid:92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000B]非常感谢您……再一次提问;[2019-01-10T23:59:59.998-06:00][xx_yyyyy-uzz-SSU][oracle.osb.statistics.statistics],0:1:5:3:6:4:6:4]“。如果我从上面的字符串中提取了ecid:92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b,那么我需要使用什么索引将其放入列‘F’中,因为我会要求您使用第二种方法,因为数据中有多个空格。对于列F,如果我只需要92b39a8b-8234-4d19-9ac7