Python如何处理文本文件中的非结构化数据
我有这样的文件格式Python如何处理文本文件中的非结构化数据,python,pandas,dataframe,machine-learning,pyspark,Python,Pandas,Dataframe,Machine Learning,Pyspark,我有这样的文件格式 # Jon Doe # 27212000-C # Calorina, 06/03 1993 # South Calorina Jaka Km 1 # Num 009.006 # Calorina. 11710, Tp.108437347343 # joe.st'a gmail.com # 20-09-2016 Akn # 36412506/E.15262 # Jakarta, 13/10/1994 # II, Let.jend, Soeprapto Gang Siaga #
# Jon Doe
# 27212000-C
# Calorina, 06/03 1993
# South Calorina Jaka Km 1
# Num 009.006
# Calorina. 11710, Tp.108437347343
# joe.st'a gmail.com
# 20-09-2016 Akn
# 36412506/E.15262
# Jakarta, 13/10/1994
# II, Let.jend, Soeprapto Gang Siaga
# V RT 005/03
# Jakarta, 10640. Tp.
# 22-09-2016/T Info
# Jenny Doe
# 5641141 2/E.15263
# Zimbabwe, 05/06/1993
# Mujair Street Iv No.185
# Mujair, 15116. Tp.04545454
# jenny@gmail.com
# 22-09-2016/T Info
# Igor Kart
# 36412777/E,15264
# Kongo, 30/10/1994
# Kp. Pintu Air Kel. Pabuaran Kec.Boj
# onggede Kab.Bogor RT 04/09
# Bogor, 16320. Tp,107262626
# igor.@gmail.com
# 22-09-2016T Info
如何从输出中获得最佳结构数据?
我想得到这样的结果。好的_format.csv
Name Code Bday Address Phone Email Info
Jon Doe 27212000-C Calorina, 06/03 1993 South Calorina Jaka Km 1Num 009.006 Calorina. 11710 108437347343 joe.st'a gmail.com 20-09-2016 Akn
Jenny Doe 5641141 2/E.15263 Zimbabwe, 05/06/1993 Mujair Street Iv No.185 Mujair, 15116. 04545454 jenny@gmail.com 22-09-2016/T Info
Igor Kart 36412777/E,15264 Kongo, 30/10/1993 Kp. Pintu Air Kel. Pabuaran Kec.Bojonggede Kab.Bogor RT 04/09Bogor, 16320. 107262626 igor.@gmail.com 22-09-2016T Info
并将错误格式记录到log.txt。
我需要错误的格式来再次修复它
# 36412506/E.15262
# Jakarta, 13/10/1994
# II, Let.jend,
# V RT 005/03
# Jakarta, 10640. Tp.
# 22-09-2016/T Info
输出:
+----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------+
| | Name | Code | Bday | Address | Phone | Email | Info |
|----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------|
| 0 | Jon Doe | 27212000-C | Calorina, 06/03 1993 | South Calorina Jaka Km 1 Num 009.006 Calorina. 11710, | 108437347343 | joe.st'a gmail.com | 20-09-2016 Akn |
| 2 | Jenny Doe | 5641141 2/E.15263 | Zimbabwe, 05/06/1993 | Mujair Street Iv No.185 Mujair, 15116. | 04545454 | jenny@gmail.com | 22-09-2016/T Info |
| 3 | Igor Kart | 36412777/E,15264 | Kongo, 30/10/1994 | Kp. Pintu Air Kel. Pabuaran Kec.Boj onggede Kab.Bogor RT 04/09 Bogor, 16320. | 107262626 | igor.@gmail.com | 22-09-2016T Info |
+----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------+
我能想到的唯一一件事就是使用正则表达式规则将模式匹配到每一列谢谢我的朋友。你能给我一些示例代码吗?@Hendra我们不是来帮你工作的。请至少努力尝试一下。嗨@mck,很抱歉用我有限的能力来打扰你。对不起,没有附上我的实验。谢谢你的建议,我的朋友,没问题!
+----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------+
| | Name | Code | Bday | Address | Phone | Email | Info |
|----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------|
| 0 | Jon Doe | 27212000-C | Calorina, 06/03 1993 | South Calorina Jaka Km 1 Num 009.006 Calorina. 11710, | 108437347343 | joe.st'a gmail.com | 20-09-2016 Akn |
| 2 | Jenny Doe | 5641141 2/E.15263 | Zimbabwe, 05/06/1993 | Mujair Street Iv No.185 Mujair, 15116. | 04545454 | jenny@gmail.com | 22-09-2016/T Info |
| 3 | Igor Kart | 36412777/E,15264 | Kongo, 30/10/1994 | Kp. Pintu Air Kel. Pabuaran Kec.Boj onggede Kab.Bogor RT 04/09 Bogor, 16320. | 107262626 | igor.@gmail.com | 22-09-2016T Info |
+----+-----------+-------------------+----------------------+------------------------------------------------------------------------------+--------------+--------------------+-------------------+