Python 查找特定字符串的行并读取该行之后的文本文件_Python_Regex_Pandas

Python 查找特定字符串的行并读取该行之后的文本文件

python regex pandas

Python 查找特定字符串的行并读取该行之后的文本文件,python,regex,pandas,Python,Regex,Pandas,我有一个文本文件（~20MB），我想从中提取一些信息。我感兴趣的信息如下所示： Generate : MESH : Cartesian 1.00000 0.00000 0.00000 0.00000 0.84680 0.00000 0.00000 0.00000 0.80724 MESH : 4 unique points x y z Weight

我有一个文本文件（~20MB），我想从中提取一些信息。我感兴趣的信息如下所示：

   Generate :
 MESH :     Cartesian
   1.00000   0.00000   0.00000
   0.00000   0.84680   0.00000
   0.00000   0.00000   0.80724
 MESH : 4 unique points
               x           y           z        Weight
    1      0.000000    0.000000    0.000000     0.3906
    2      0.125000    0.000000    0.000000     0.7812
    3      0.250000    0.000000    0.000000     0.7812
    4      0.375000    0.000000    0.000000     0.7812

我想在第二次出现字符串“MESH”后将x、y、z列保存到数组中。我尝试使用regex，但我的解决方案将结果保存为一个列表，这使得以后调用这些值变得太复杂。以下是我的尝试：

import re

line_number = 0
mesh_list = []
Qp = []
with open('out.test','r') as f:
    for line in f:
        line_number +=1
        if 'MESH' in line:
            mesh_list.append([(line_number),line.rstrip()])

point_info = mesh_list[1]
output_line = point_info[0]             ## Line number where MESH appears the second time.
point_list = point_info[1].split()
num_of_points = int(point_list[1])      ## Get number of unique points.

with open('out.test','r') as f:
    for i, line in enumerate(f):
        if output_line+1 <= i <= output_line+num_of_points:
            Qp.append([line])

print(Qp)

重新导入
线号=0
网格列表=[]
Qp=[]
将open（'out.test'，'r'）作为f：
对于f中的行：
行号+=1
如果“网格”在直线上：
mesh_list.append（[（行编号），line.rstrip（）]））
点信息=网格列表[1]
输出线=第二次显示网格的点信息[0]35;#线号。
点列表=点信息[1]。拆分（）
num_of_points=int（point_list[1]）35;#获取唯一点的数量。
将open（'out.test'，'r'）作为f：
对于i，枚举（f）中的行：
如果输出线+1可以与自定义skiprows=
和sep=
参数一起使用：
import re
import pandas as pd

r = re.compile(r"MESH : \d+ unique points")

line_counter = 0
with open("your_file.txt", "r") as f_in:
    for l in f_in:
        line_counter += 1
        if r.search(l):
            break

df = pd.read_csv("your_file.txt", skiprows=line_counter, sep=r"\s+")
print(df)

印刷品：
xyz重量
1  0.000  0.0  0.0  0.3906
2  0.125  0.0  0.0  0.7812
3  0.250  0.0  0.0  0.7812
4  0.375  0.0  0.0  0.7812
要从Qp
中获取x
，y
，z
（行号和权重
），这是一个单元素列表，如元组
s（将元组
s转换为列表
s非常简单），您可以尝试：
>>Qp
['10.0000000.0000000.0000000.3906\n']，['2 0.125000.0000000.0000000.7812\n']，['3 0.250000.0000000.0000000.7812\n']，['4 0.375000 0.0000000.0000000.7812\n']
>>>lno，x，y，z，Weight=zip（*（行[0].split（）表示Qp中的行））
>>>lno
('1', '2', '3', '4')
>>>x
('0.000000', '0.125000', '0.250000', '0.375000')
>>>y
('0.000000', '0.000000', '0.000000', '0.000000')
>>>z
('0.000000', '0.000000', '0.000000', '0.000000')
>>>重量
('0.3906', '0.7812', '0.7812', '0.7812')

对于float
s而不是str
s：
>lno，x，y，z，Weight=zip（*（float（a）表示直线[0]。split（））表示直线（Qp））

要将x
、y
、z
（和Weight
）作为数据帧的列获取，请执行以下操作：
导入熊猫
>>>进口稀土
>>>
>>>将open（'out.test'，'r'）作为f：
...     对于i，枚举（f，1）中的行：
...         m=重新搜索（'网格：（\d+）唯一点'，行）
...         如果m：
...             打破
...
>>>我
6.
>>>m组（1）
'4'
>>>df=pd.read\u csv（'out.test'，skiprows=i，nrows=int（m.group（1））+1，sep=r“\s+”）
>>>df
x y z重量
1  0.000  0.0  0.0  0.3906
2  0.125  0.0  0.0  0.7812
3  0.250  0.0  0.0  0.7812
4  0.375  0.0  0.0  0.7812

Hi，这看起来非常优雅，但我遇到了以下错误：pandas.errors.EmptyDataError：没有要解析的列file@user175924确保正则表达式正确（正确的空格数等）。可能未找到regex

“MESH:\d+unique points”

，pandas会跳过所有行。出现此问题的原因是skiprows=line\u计数器。当for循环找到r中定义的字符串时，它不会中断，因此它读取所有行，而在df中，跳过所有行，就没有什么可读取的了。当我使用代码中获取行号（output_line）的部分时，它会给出正确的结果。有什么问题吗？@user175924当正则表达式找到成功匹配时，会出现

中断。显然，正则表达式网格：\d+唯一点
有一些困难。我没有你的确切档案，所以我无法确定问题所在。但您可以使用不带正则表达式的解决方案-只需在文件中找到第二个“网格”并断开。啊！对不起，我的错。在那条线上有一点不同。谢谢！