Python 3.x 如何在pyspark中处理动态类型的实时日志流?
我正在尝试制作一个spark应用程序来处理和处理实时流媒体动态日志。 以下是日志结构:Python 3.x 如何在pyspark中处理动态类型的实时日志流?,python-3.x,pyspark,apache-spark-sql,spark-streaming,Python 3.x,Pyspark,Apache Spark Sql,Spark Streaming,我正在尝试制作一个spark应用程序来处理和处理实时流媒体动态日志。 以下是日志结构: 2020-09-24S08:07:54.181Z ip1 Sep 08:07:54 region1 staus1: Deny tcp src outside-myip1/100 dst inside:mydstIP1/80 2020-09-24S08:07:55.181Z ip2 Sep 08:07:54 region2 staus2: Deny tcp src outside-myip2/101 dst i
2020-09-24S08:07:54.181Z ip1 Sep 08:07:54 region1 staus1: Deny tcp src outside-myip1/100 dst inside:mydstIP1/80
2020-09-24S08:07:55.181Z ip2 Sep 08:07:54 region2 staus2: Deny tcp src outside-myip2/101 dst inside:mydstIP2/80
2020-09-24S08:07:56.181Z ip3 Sep 08:07:54 region3 staus3: Deny tcp src outside-myip3/102 dst inside:mydstIP3/80
2020-09-24S08:07:57.181Z ip4 Sep 08:07:54 region4 staus4: other requested to drop TCP packet from outside-myip01/132 to dmz:myip02/443 by the IT Group
2020-09-24S08:07:58.181Z ip5 Sep 08:07:54 region5 staus5: Deny tcp src outside-myip4/103 dst inside:mydstIP4/80
2020-09-24S08:07:59.181Z ip6 Sep 08:07:54 region6 staus6: Deny tcp src outside-myip5/104 dst inside:mydstIP5/80
2020-09-24S08:07:57.181Z ip4 Sep 08:07:54 region4 staus04: other requested to drop TCP packet from outside-myip04/132 to dmz:myip02/443 by the IT Group
2020-09-24S08:08:00.181Z ip7 Sep 08:07:54 region7 staus7: Deny tcp src outside-myip6/105 dst inside:mydstIP6/80
我创建了下面的模式,将上面的日志转换为“结构化数据框架”
schemaDf = StructType([
StructField(" Date", DateType()),
StructField("Source IP", StringType()),
StructField("Month", StringType()),
StructField("Time Stamp", StringType()),
StructField("Region", StringType()),
StructField("status", StringType()),
StructField("Action", StringType()),
StructField("Protocol", StringType()),
StructField("From", StringType()),
StructField("Source Value", StringType()),
StructField("To", StringType()),
StructField("Destincation value", StringType()),
])
df = session.read.option("header", "true").option("delimiter", " ").csv("F:mypath\\firewall.txt", schema=schemaDf)
df.show()
结果:
+----------+---------+-----+----------+-------+--------+------+---------+----+-----------------+---+------------------+
| Date|Source IP|Month|Time Stamp| Region| status|Action| Protocol|From| Source Value| To|Destincation value|
+----------+---------+-----+----------+-------+--------+------+---------+----+-----------------+---+------------------+
|2020-09-24| ip2| Sep| 08:07:54|region2| staus2:| Deny| tcp| src|outside-myip2/101|dst|inside:mydstIP2/80|
|2020-09-24| ip3| Sep| 08:07:54|region3| staus3:| Deny| tcp| src|outside-myip3/102|dst|inside:mydstIP3/80|
|2020-09-24| ip4| Sep| 08:07:54|region4| staus4:| other|requested| to| drop|TCP| packet|
|2020-09-24| ip5| Sep| 08:07:54|region5| staus5:| Deny| tcp| src|outside-myip4/103|dst|inside:mydstIP4/80|
|2020-09-24| ip6| Sep| 08:07:54|region6| staus6:| Deny| tcp| src|outside-myip5/104|dst|inside:mydstIP5/80|
|2020-09-24| ip4| Sep| 08:07:54|region4|staus04:| other|requested| to| drop|TCP| packet|
|2020-09-24| ip7| Sep| 08:07:54|region7| staus7:| Deny| tcp| src|outside-myip6/105|dst|inside:mydstIP6/80|
+----------+---------+-----+----------+-------+--------+------+---------+----+-----------------+---+------------------+
这里创建的模式不适合处理上面的动态日志,“操作”列中的“值”是动态的。例如,“Action”列中的“other”具有长字符串,无法适合定义的架构
所以我想知道什么是处理此类案件的正确方法
我是否需要为“其他”值创建新的架构
谢谢你的帮助
谢谢有人能帮你提建议吗??