Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark flatMap在数据帧中创建多行_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql - Fatal编程技术网

Apache spark flatMap在数据帧中创建多行

Apache spark flatMap在数据帧中创建多行,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,根据之前的SO答案: 似乎在一次操作中就可以使用“映射并过滤出错误案例” 给定样本数据: spark.read.text("/mnt/seedx-ops-prod/genee-local-datasync/genee-3/genee/logs/genee_python-20190417T075453.005.log").show(4, False) +--------------------------------------------------------------------

根据之前的SO答案:

似乎在一次操作中就可以使用“映射并过滤出错误案例”

给定样本数据:

spark.read.text("/mnt/seedx-ops-prod/genee-local-datasync/genee-3/genee/logs/genee_python-20190417T075453.005.log").show(4, False)

+---------------------------------------------------------------------------------------------+
|value                                                                                        |
+---------------------------------------------------------------------------------------------+
|2019-04-17 07:54:51.505: 2019-04-17 10:54:51 INFO [main.py:64] Read machine_conf.ini         |
|2019-04-17 07:54:52.271: 2019-04-17 10:54:52 INFO [app.py:93] Running web server on port 9090|
|2019-04-17 08:05:10.720: 2019-04-17 11:05:10 INFO [app.py:166] Exiting event loop...         |
|2019-04-17 08:05:10.720: <_WindowsSelectorEventLoop running=False closed=False debug=False>  |
+---------------------------------------------------------------------------------------------+
预期的结果是

+-----------------------+-------------------+---------+-------------------------------------------+
|os_ts                  |log_ts             |log_level|message                                    |
+-----------------------+-------------------+---------+-------------------------------------------+
|2019-04-17 07:54:51.505|2019-04-17 10:54:51|INFO     |[main.py:64] Read machine_conf.ini         |
|2019-04-17 07:54:52.271|2019-04-17 10:54:52|INFO     |[app.py:93] Running web server on port 9090|
|2019-04-17 08:05:10.720|2019-04-17 11:05:10|INFO     |[app.py:166] Exiting event loop...         |
+-----------------------+-------------------+---------+-------------------------------------------+
实际结果如下:

genee3_python_logs_text = spark.read.text("/mnt/seedx-ops-prod/genee-local-datasync/genee-3/genee/logs/genee_python-20190417T075453.005.log")

clean_genee3_python_logs = genee3_python_logs_text.rdd.flatMap(parseTheNonSuckingDaemonPythonLogs)

from pyspark.sql import Row

row = Row("val")
genee3_python_logs_df = clean_genee3_python_logs.map(row).toDF()
genee3_python_logs_df.select('*').show(truncate=False)

+-------------------------------------------+
|val                                        |
+-------------------------------------------+
|INFO                                       |
|2019-04-17 10:54:51                        |
|[main.py:64] Read machine_conf.ini         |
|2019-04-17 07:54:51.505                    |
|INFO                                       |
|2019-04-17 10:54:52                        |
|[app.py:93] Running web server on port 9090|
|2019-04-17 07:54:52.271                    |
|INFO                                       |
|2019-04-17 11:05:10                        |
|[app.py:166] Exiting event loop...         |
|2019-04-17 08:05:10.720                    |
+-------------------------------------------+

我想我已经设法让它工作了,但我仍然不确定有哪些功能转换可以帮助它工作

将该行包装到解析逻辑中的另一行中: 在DataFrame声明中删除行自适应: 结果:
genee3_python_logs_text = spark.read.text("/mnt/seedx-ops-prod/genee-local-datasync/genee-3/genee/logs/genee_python-20190417T075453.005.log")

clean_genee3_python_logs = genee3_python_logs_text.rdd.flatMap(parseTheNonSuckingDaemonPythonLogs)

from pyspark.sql import Row

row = Row("val")
genee3_python_logs_df = clean_genee3_python_logs.map(row).toDF()
genee3_python_logs_df.select('*').show(truncate=False)

+-------------------------------------------+
|val                                        |
+-------------------------------------------+
|INFO                                       |
|2019-04-17 10:54:51                        |
|[main.py:64] Read machine_conf.ini         |
|2019-04-17 07:54:51.505                    |
|INFO                                       |
|2019-04-17 10:54:52                        |
|[app.py:93] Running web server on port 9090|
|2019-04-17 07:54:52.271                    |
|INFO                                       |
|2019-04-17 11:05:10                        |
|[app.py:166] Exiting event loop...         |
|2019-04-17 08:05:10.720                    |
+-------------------------------------------+
def parseTheNonSuckingDaemonPythonLogs(row):
  try:
    parts = re.findall(r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{1,3}): (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) ([A-Za-z]{1,5}) (.*)', row.value)[0]
    return Row(Row(os_ts=parts[0], log_ts=parts[1], log_level=parts[2], message=parts[3]))
  except:
    return Row()
genee3_python_logs_df = clean_genee3_python_logs.toDF()
genee3_python_logs_df.show(truncate=False)

+---------+-------------------+-------------------------------------------+-----------------------+
|log_level|log_ts             |message                                    |os_ts                  |
+---------+-------------------+-------------------------------------------+-----------------------+
|INFO     |2019-04-17 10:54:51|[main.py:64] Read machine_conf.ini         |2019-04-17 07:54:51.505|
|INFO     |2019-04-17 10:54:52|[app.py:93] Running web server on port 9090|2019-04-17 07:54:52.271|
|INFO     |2019-04-17 11:05:10|[app.py:166] Exiting event loop...         |2019-04-17 08:05:10.720|
+---------+-------------------+-------------------------------------------+-----------------------+