Regex 带多行正则表达式的Spark RDD[String]上的正则表达式

Regex 带多行正则表达式的Spark RDD[String]上的正则表达式,regex,scala,apache-spark,cloudera-cdh,Regex,Scala,Apache Spark,Cloudera Cdh,我正在尝试使用scala解析Spark 1.6中的日志文件,下面是示例数据 2017-02-04 04:48:11,123 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 0 triggers> 2017-02-04 04:48:20,892 INFO [org.jasig.inspektr.audit.support.Slf4jLoggingAuditTrailManager] - <

我正在尝试使用scala解析Spark 1.6中的日志文件,下面是示例数据

2017-02-04 04:48:11,123 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 0 triggers>
2017-02-04 04:48:20,892 INFO [org.jasig.inspektr.audit.support.Slf4jLoggingAuditTrailManager] - <Audit trail record BEGIN
=============================================================
WHO: audit:unknown
WHAT: TGT-7d937-yRqp6ObM7JOtkUZ7Ff4yEo95-casino1.example.org
ACTION: TICKET_GRANTING_TICKET_DESTROYED
APPLICATION: CASINO
WHEN: Sat Feb 04 04:48:20 AEDT 2017
CLIENT IP ADDRESS: 160.50.201.557
SERVER IP ADDRESS: login.cfu.asg
=============================================================

>
2017-02-04 04:48:32,165 INFO [org.jasig.cas.services.DefaultServicesManagerImpl] - <Reloading registered services.>
2017-02-04 04:48:32,167 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Loaded 2 services.>
2017-02-04 04:48:38,889 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 1 triggers>
2017-02-04 04:48:52,790 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 0 triggers>
2017-02-04 04:48:52,790 DEBUG [org.quartz.core.JobRunShell] - <Calling execute on job DEFAULT.serviceRegistryReloaderJobDetail>
2017-02-04 04:48:52,790 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Reloading registered services.>
2017-02-04 04:48:52,792 DEBUG [org.jasig.casino.services.DefaultServicesManagerImpl] - <Adding registered service ^(https?|imaps?)://.*>
2017-02-04 04:48:52,792 DEBUG [org.jasig.casino.services.DefaultServicesManagerImpl] - <Adding registered service
2017-02-04 04:48:52,792 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Loaded 2 services.>
2017-02-04 04:49:14,365 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Reloading registered services.>
2017-02-04 04:49:14,366 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Loaded 2 services.>
2017-02-04 04:49:19,699 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 0 triggers>
2017-02-04 04:49:43,465 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 0 triggers>
2017-02-04 04:50:00,978 INFO [org.jasig.casino.authentication.PolicyBasedAuthenticationManager] - <JaasAuthenticationHandler successfully authenticated >
2017-02-04 04:50:00,978 INFO [org.jasig.casino.authentication.PolicyBasedAuthenticationManager] - <Authenticated 3785973 with credentials.>
2017-02-04 04:50:00,978 INFO [org.jasig.inspektr.nhgij.support.Slf4jLogggbhAuditTrailManaver] - <Audit trail record BEGIN
=============================================================
WHO: z3705z73
WHAT: supplied credentials: [d37c5973]
ACTION: AUTHENTICATION_SUCCESS
APPLICATION: casinoINO
WHEN: Sat Feb 04 04:50:00 AEDT 2017
CLIENT IP ADDRESS: 101.181.28.555
SERVER IP ADDRESS: login.cfu.asg
=============================================================

>

我们如何读取多行输入并将其馈送到regex?

我已经修复并改进了您的regex,现在它应该可以用于您的最后几行日志:

正则表达式是以下beast:

(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}),\d{3}\s+(\w+)\s+\[(.*)\]\s+\-\s+<[^>]*\s\=*\s+WHO\:\s+([^>\n]*)\s+WHAT\:\s+([^>\n]*)\s+ACTION\:\s+([^>\n]*)\s+APPLICATION\:\s+([^>\n]*)\s+WHEN\:\s+([^>\n]*)\s+([A-Z\s]{17}\:)\s+([^>\n]*)\s+([A-Z\s]{17}\:)\s+([^>\n]*)\s+\=*\s\s>
结果如下:


最后但并非最不重要的一点是,您可能必须更改堆大小:-executor memory 10g

您是否尝试过增加堆大小?执行人对我的回答满意吗?我希望它能帮助你!!!感谢改进的regex,我尝试使用-executor memory 10g,但仍然抛出错误java.lang.OutOfMemoryError:GC开销超出限制出于某种原因,垃圾收集器占用了超过进程CPU时间98%的时间,并且每次恢复堆2%的内存时间很少。这实际上意味着您的程序停止执行任何进度,并且始终只忙于运行垃圾收集。为了防止应用程序在没有完成任何操作的情况下占用CPU时间,JVM会抛出此错误,以便您有机会诊断问题。在某些代码中会发生这种情况,其中大量临时对象是在内存已经非常有限的环境中创建的。@Allan将堆大小设置为10G并没有解决outofmemory错误。我无法排除其他可能出现的问题。我做了一个变通以继续处理。我调用sc.textfiles并读取大型输入文件,过滤它们,保存到临时位置,然后在sc.wholetextfiles中读取它们。临时文件的大小不到原始文件的一半,因此没有抛出outofMemory错误。
(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}),\d{3}\s+(\w+)\s+\[(.*)\]\s+\-\s+<[^>]*\s\=*\s+WHO\:\s+([^>\n]*)\s+WHAT\:\s+([^>\n]*)\s+ACTION\:\s+([^>\n]*)\s+APPLICATION\:\s+([^>\n]*)\s+WHEN\:\s+([^>\n]*)\s+([A-Z\s]{17}\:)\s+([^>\n]*)\s+([A-Z\s]{17}\:)\s+([^>\n]*)\s+\=*\s\s>
\1 | \2 | \3 | WHO:\4 | WHAT: \5 | ACTION: \6 | APPLICATION: \7 | WHEN: \8 | \9  $10 | $11  $12