Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/logging/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 如何在pyspark中处理动态类型的实时日志流?_Python 3.x_Pyspark_Apache Spark Sql_Spark Streaming - Fatal编程技术网

Python 3.x 如何在pyspark中处理动态类型的实时日志流?

Python 3.x 如何在pyspark中处理动态类型的实时日志流?,python-3.x,pyspark,apache-spark-sql,spark-streaming,Python 3.x,Pyspark,Apache Spark Sql,Spark Streaming,我正在尝试制作一个spark应用程序来处理和处理实时流媒体动态日志。 以下是日志结构: 2020-09-24S08:07:54.181Z ip1 Sep 08:07:54 region1 staus1: Deny tcp src outside-myip1/100 dst inside:mydstIP1/80 2020-09-24S08:07:55.181Z ip2 Sep 08:07:54 region2 staus2: Deny tcp src outside-myip2/101 dst i

我正在尝试制作一个spark应用程序来处理和处理实时流媒体动态日志。 以下是日志结构:

2020-09-24S08:07:54.181Z ip1 Sep 08:07:54 region1 staus1: Deny tcp src outside-myip1/100 dst inside:mydstIP1/80
2020-09-24S08:07:55.181Z ip2 Sep 08:07:54 region2 staus2: Deny tcp src outside-myip2/101 dst inside:mydstIP2/80
2020-09-24S08:07:56.181Z ip3 Sep 08:07:54 region3 staus3: Deny tcp src outside-myip3/102 dst inside:mydstIP3/80
2020-09-24S08:07:57.181Z ip4 Sep 08:07:54 region4 staus4: other requested to drop TCP packet from outside-myip01/132 to dmz:myip02/443 by the IT Group
2020-09-24S08:07:58.181Z ip5 Sep 08:07:54 region5 staus5: Deny tcp src outside-myip4/103 dst inside:mydstIP4/80
2020-09-24S08:07:59.181Z ip6 Sep 08:07:54 region6 staus6: Deny tcp src outside-myip5/104 dst inside:mydstIP5/80
2020-09-24S08:07:57.181Z ip4 Sep 08:07:54 region4 staus04: other requested to drop TCP packet from outside-myip04/132 to dmz:myip02/443 by the IT Group
2020-09-24S08:08:00.181Z ip7 Sep 08:07:54 region7 staus7: Deny tcp src outside-myip6/105 dst inside:mydstIP6/80
我创建了下面的模式,将上面的日志转换为“结构化数据框架”

schemaDf = StructType([
        StructField(" Date", DateType()),
        StructField("Source IP", StringType()),
        StructField("Month", StringType()),
        StructField("Time Stamp", StringType()),
        StructField("Region", StringType()),
        StructField("status", StringType()),
        StructField("Action", StringType()),
        StructField("Protocol", StringType()),
        StructField("From", StringType()),
        StructField("Source Value", StringType()),
        StructField("To", StringType()),
        StructField("Destincation value", StringType()),
    ])

df = session.read.option("header", "true").option("delimiter", " ").csv("F:mypath\\firewall.txt", schema=schemaDf)
df.show()
结果:


+----------+---------+-----+----------+-------+--------+------+---------+----+-----------------+---+------------------+
|      Date|Source IP|Month|Time Stamp| Region|  status|Action| Protocol|From|     Source Value| To|Destincation value|
+----------+---------+-----+----------+-------+--------+------+---------+----+-----------------+---+------------------+
|2020-09-24|      ip2|  Sep|  08:07:54|region2| staus2:|  Deny|      tcp| src|outside-myip2/101|dst|inside:mydstIP2/80|
|2020-09-24|      ip3|  Sep|  08:07:54|region3| staus3:|  Deny|      tcp| src|outside-myip3/102|dst|inside:mydstIP3/80|
|2020-09-24|      ip4|  Sep|  08:07:54|region4| staus4:| other|requested|  to|             drop|TCP|            packet|
|2020-09-24|      ip5|  Sep|  08:07:54|region5| staus5:|  Deny|      tcp| src|outside-myip4/103|dst|inside:mydstIP4/80|
|2020-09-24|      ip6|  Sep|  08:07:54|region6| staus6:|  Deny|      tcp| src|outside-myip5/104|dst|inside:mydstIP5/80|
|2020-09-24|      ip4|  Sep|  08:07:54|region4|staus04:| other|requested|  to|             drop|TCP|            packet|
|2020-09-24|      ip7|  Sep|  08:07:54|region7| staus7:|  Deny|      tcp| src|outside-myip6/105|dst|inside:mydstIP6/80|
+----------+---------+-----+----------+-------+--------+------+---------+----+-----------------+---+------------------+
这里创建的模式不适合处理上面的动态日志,“操作”列中的“值”是动态的。例如,“Action”列中的“other”具有长字符串,无法适合定义的架构

所以我想知道什么是处理此类案件的正确方法

我是否需要为“其他”值创建新的架构

谢谢你的帮助


谢谢

有人能帮你提建议吗??