Python 解析包含批次头信息的混合文本文件_Python

Python 解析包含批次头信息的混合文本文件

python

Python 解析包含批次头信息的混合文本文件,python,Python,我是一名业余天文学家（退休了），只是在胡思乱想。我想从美国宇航局的一个网站上抓取数据，这是一个文本文件，并提取特定的值，以帮助我确定何时进行观察。文本文件每60秒自动更新一次。文本文件在我需要处理的实际数据行和数据列之前有一些标题信息。实际数据是数字的。例如：美国宇航局编制请将意见和建议发送至xxx.com 日期时间年年月日hhmm日值1值2 2019 03 31 1933 234 6.00e-09 1.00e-09 我想访问字符串数字数据并将其转换为双精度据我所见，该文件以空格分隔

我是一名业余天文学家（退休了），只是在胡思乱想。我想从美国宇航局的一个网站上抓取数据，这是一个文本文件，并提取特定的值，以帮助我确定何时进行观察。文本文件每60秒自动更新一次。文本文件在我需要处理的实际数据行和数据列之前有一些标题信息。实际数据是数字的。例如：

美国宇航局编制

请将意见和建议发送至xxx.com

日期时间年

年月日hhmm日值1值2

2019 03 31 1933 234 6.00e-09 1.00e-09

我想访问字符串数字数据并将其转换为双精度

据我所见，该文件以空格分隔

我想每60秒轮询一次网站，如果值1和值2高于特定阈值，将触发PyAutoGUI自动执行软件应用程序拍摄图像

在从网站上读取文件后，我尝试将文件转换成字典，以为可以将键映射到值，但无法预测所需的确切位置。我想，一旦我提取了所需的值，我就会编写该文件，然后尝试将字符串转换为double或float

我试着用

import re
re.split

阅读每一行并拆分信息，但由于标题信息，我得到了一个巨大的混乱

我想用一种简单的方法打开这个文件，这样就可以了


import urllib
import csv

data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()

print (data)

我在堆栈溢出上发现了这个，但我不知道如何使用它

file = open('abc.txt','r')
while 1:
    a = file.readline()
    if a =='': break
    a = a.split()                  #This creates a list of the input
    name = a[0]
    value = int(a[1])              # or value=float(a[1]) whatever you want
    #use the name and value howsoever
f.close()

我想要的是将值1和值2提取为双精度或浮点值，而不是第二部分（我还没有开始）我将比较值1和值2，如果它们大约是一个特定的阈值，这将触发PyAutoGUI与我的成像软件交互并触发拍摄图像。

下面是一个使用正则表达式的简单示例。这假设您用一个

f.read（）

将整个文件读入内存，而用正则表达式处理单个行通常是更简单的方法（我很懒，不想创建测试文件）：

重新导入
data=”“”
废话
废话
年月日hhmm日值1值2
2019 03 31 1933 234 6.00e-09 1.00e-09
废话
"""
pattern=re.compile（r“（\d+）（\d+）（\d+）（\d+）（\d+）（[^\s]+）（[^\s]+）”）
def main（）：
m=模式搜索（数据）
如果m：
#在这里做任何你想做的处理。您可以访问所有7个输入
#通过m.group的字段（1-7）
d1=浮动（m组（6））
d2=浮动（m组（7））
print（“>{}<>{}{}<>{}这是一个有趣的小挑战，我为您从表中取出了所有数据，将其映射到类，并将数据转换为int
和Decimal
，如果合适的话。一旦填充了数据，您就可以从中读取所有需要的数据
为了获取数据，我使用了请求库，而不是urllib
，这仅仅是个人偏好。如果您也想使用它，可以使用pip install requests
。它有一个方法iter\u line
，可以遍历数据行
对于你需要的东西来说，这可能太过分了，但在我写这篇文章的时候，我想我会把它贴给你
import re
from datetime import datetime
from decimal import Decimal

import requests


class SolarXrayFluxData:
    def __init__(
            self,
            year,
            month,
            day,
            time,
            modified_julian_day,
            seconds_of_the_day,
            short,
            long
    ):
        self.date = datetime(
            int(year), int(month), int(day), hour=int(time[:2]), minute=int(time[2:])
        )
        self.modified_julian_day = int(modified_julian_day)
        self.seconds_of_the_day = int(seconds_of_the_day)
        self.short = Decimal(short)
        self.long = Decimal(long)


class GoesXrayFluxPrimary:
    def __init__(self):
        self.created_at = ''
        self.data = []

    def extract_data(self, url):
        data = requests.get(url)
        for i, line in enumerate(data.iter_lines(decode_unicode=True)):
            if line[0] in [':', '#']:
                if i is 1:
                    self.set_created_at(line)
                continue

            row_data = re.findall(r"(\S+)", line)
            self.data.append(SolarXrayFluxData(*row_data))

    def set_created_at(self, line):
        date_str = re.search(r'\d{4}\s\w{3}\s\d{2}\s\d{4}', line).group(0)
        self.created_at = datetime.strptime(date_str, '%Y %b %d %H%M')


if __name__ == '__main__':
    goes_xray_flux_primary = GoesXrayFluxPrimary()
    goes_xray_flux_primary.extract_data('https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt')
    print("Created At: %s" % goes_xray_flux_primary.created_at)
    for row in goes_xray_flux_primary.data:
        print(row.date)
        print("%.12f, %.12f" % (row.short, row.long))



SolarXrayFluxData
类的目的是存储每个项目的数据，并确保其具有良好的可用格式。而GoesXrayFluxPrimary
类用于填充SolarXrayFluxData
的列表，并存储您可能想要提取的任何其他数据。例如，我抓取了创建的日期和时间。您还可以从标题数据中获取位置
和源
。
我认为您的一般方法应该是在开始实际处理之前读取标题数据。例如，您可以读取行，直到获得“日期时间年”“行，如果你可以指望它一直存在的话。-这是正则表达式的一个很好的用例。但也许你对它们了解不够，不想在这里使用它们。这就是编程的乐趣所在。有很多方法可以剥猫皮！-或者我的意思是寻找“yr mo da hhmm day VALUE 1 VALUE 2”，如果这是数据中的字面意思。re.findall（pattern，string，flags=0）可能是更好的选择，因为它将“返回字符串中模式的所有非重叠匹配”在这种情况下，这主要是一种风格选择。我已经通过移动每个匹配的位置来处理重叠问题。findall的缺点是，你会遇到更复杂的表达式的麻烦，而不是所有的组都匹配。我将搜索作为一般规则，所以无论表达式有多复杂，我都会以相同的方式处理。-不是吗她这样说是因为findall，你失去了对match对象本身的访问权。-我同意在这种情况下，findall同样可以很好地工作，甚至可以使后处理变得更简单。嘿@Walter，这给了你你需要的吗？你需要任何额外的帮助吗？我不知道你是否是这里的上流人士之一。复选标记是r你。如果这回答了你的问题，你应该点击复选标记使其变为绿色，这样每个人都可以知道你得到了你喜欢的答案。（顺便说一句，我不在乎额外的5分。我真的很感兴趣的是OP是否从中得到了他想要的）你好，史蒂夫，是的，这帮了我很大的忙！我必须把你写的东西一行一行地看一遍，手里拿着我的Python食谱。我不希望你所做的对我来说是一个“黑魔法”的代码盒。42年前在大学里，我上了一堂Fortran课，上个月我一直在玩Python。所有的Python参考法语对我学习编写简单项目的代码很有帮助。现在，我必须绞尽脑汁，如何将我学到的所有离散的东西应用到更大的应用程序中。再次感谢你！沃尔特你非常受欢迎。我希望这里唯一的棘手之处是
>6e-09< >1e-09<

import re
import urllib

pattern = re.compile(r"(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)")

def main():

    data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()

    pos = 0
    while True:
        m = pattern.search(data, pos)
        if not m:
            break
        pos = m.end()
        # Do whatever processing you want to do here.  You have access to all 8 input
        # fields via m.group(1-8)
        f1 = float(m.group(7))
        f2 = float(m.group(8))
        print(">{}< >{}<".format(f1, f2))

main()

>9.22e-09< >1e-09<
>1.06e-08< >1e-09<
...
>8.99e-09< >1e-09<
>1.01e-08< >1e-09<

import re
from datetime import datetime
from decimal import Decimal

import requests


class SolarXrayFluxData:
    def __init__(
            self,
            year,
            month,
            day,
            time,
            modified_julian_day,
            seconds_of_the_day,
            short,
            long
    ):
        self.date = datetime(
            int(year), int(month), int(day), hour=int(time[:2]), minute=int(time[2:])
        )
        self.modified_julian_day = int(modified_julian_day)
        self.seconds_of_the_day = int(seconds_of_the_day)
        self.short = Decimal(short)
        self.long = Decimal(long)


class GoesXrayFluxPrimary:
    def __init__(self):
        self.created_at = ''
        self.data = []

    def extract_data(self, url):
        data = requests.get(url)
        for i, line in enumerate(data.iter_lines(decode_unicode=True)):
            if line[0] in [':', '#']:
                if i is 1:
                    self.set_created_at(line)
                continue

            row_data = re.findall(r"(\S+)", line)
            self.data.append(SolarXrayFluxData(*row_data))

    def set_created_at(self, line):
        date_str = re.search(r'\d{4}\s\w{3}\s\d{2}\s\d{4}', line).group(0)
        self.created_at = datetime.strptime(date_str, '%Y %b %d %H%M')


if __name__ == '__main__':
    goes_xray_flux_primary = GoesXrayFluxPrimary()
    goes_xray_flux_primary.extract_data('https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt')
    print("Created At: %s" % goes_xray_flux_primary.created_at)
    for row in goes_xray_flux_primary.data:
        print(row.date)
        print("%.12f, %.12f" % (row.short, row.long))