Python 确定行开头的实际文件位置_Python_Xml_Python 2.7_File_Iteration

Python 确定行开头的实际文件位置

python xml python-2.7 file

Python 确定行开头的实际文件位置,python,xml,python-2.7,file,iteration,Python,Xml,Python 2.7,File,Iteration,我正在编写一个Python脚本，它将迭代一个大型XML文章文件，并将每篇文章写入一个新文件。我还想在文件中保留每篇文章的起始位置索引。如果脚本在处理文件时失败（这需要很多小时），我希望能够使用类似file.seek（）之类的工具重新开始这是我当前的代码： with open(inputPath, 'rb') as inputfile: # Iterate over the dump for line in inputfile: # do stuff... 我相信这适用于这里，

我正在编写一个Python脚本，它将迭代一个大型XML文章文件，并将每篇文章写入一个新文件。我还想在文件中保留每篇文章的起始位置索引。如果脚本在处理文件时失败（这需要很多小时），我希望能够使用类似

file.seek（）

之类的工具重新开始

这是我当前的代码：

with open(inputPath, 'rb') as inputfile: # Iterate over the dump
    for line in inputfile:
    # do stuff...

我相信这适用于这里，不会给我正确的文件位置。我还想确定行的开始位置，而不是结束位置

XML文件如下所示：

<pages>
  <page>
  This is an article.  Article-ly stuff goes here.
  </page>
  <page>
  ... x5,000,000
  </page>
</pages>


这是一篇文章。这里有很多东西。
... x500000

我想在迭代文件时用

标记获取行的开头位置。

您可以自己控制字节指针。在本例中，我使用带有字节指针的

字典

，并将其存储在

工具架

中。然后我从该

xml

导出

html

文件

默认情况下，此脚本在每导出1000页时输出状态。它会在项目的根目录上生成一个图形图像（“xml-split-performance.jpg”，默认情况下，就像本文末尾的一样），因此您可以确保该过程正常工作

默认情况下，我使用8个worker：我在内存中存储8页，然后分发8个worker要编写的8页。不确定这是不是最好的方法，但对于我的50k

html

s来说效果很好

from multiprocessing.pool import ThreadPool as Pool
import matplotlib.pyplot as plt
import itertools, os, sys, datetime, math
from urlparse import urljoin
import shelve
import collections
from functools import partial

class xml_split():

    # Flags
    track_line = False
    last_line = False
    output_file = False
    all_pages_extracted = False

    # Modes available: 
    # - Read output folder and calculate missing files: c (default)
    # - Start from scratch: s
    resume_modes = {
        'complete_file_count': 'c',
        'start_from_scratch': 's',
    }

    # Queue of pages in memory, before writting to file
    pages_queue = []

    def __init__(
            self,

            # Number of workers
            pool_size = 8,

            # Read x pages at a time
            page_memory = 30,

            # Path to the input xml file
            input_xml_file_path = "sample.xml",

            # File name prefix and extension
            output_file_prefix = 'page_',   
            output_file_extension = '.html',

            # Path to the output folder: will be created under the script path
            page_files_folder = 'pages',   

            # Path to the cache folder: will be created under the script path
            cache_files_folder = 'cache',  

            # filename of a graph that shows the average performance
            graph_file_path = 'xml-split-performance.jpg',

            # update graph each x number of pages extracted
            update_graph_each = 1000,

            # print status on stdout each x number of pages extracted
            show_status_each = 1000,

            # group files in folders of x pages
            batch_size = 1000,

            # tags to track
            start_string = '<page>',
            end_string = '</page>',
            start_doc = '<pages>',
            end_doc = '</pages>',

            # A little template to output the status
            log_template = """
            Page:                                  {page}
            Time:                                  {exported}
            Time since the beginning:              {queue}
            Reporting each n exports:              {show_status_each}
            Time since last report:                {process}
            Average entry export time:             {average}
            """,
        ):

        self.pool_size = pool_size
        self.pool = Pool(pool_size)

        self.input_xml_file_path = input_xml_file_path
        self.input_file = open(input_xml_file_path)

        self.output_file_prefix = output_file_prefix
        self.output_file_extension = output_file_extension
        self.page_files_folder = page_files_folder
        self.cache_files_folder = cache_files_folder

        self.page_memory = page_memory
        self.show_status_each = show_status_each
        self.update_graph_each = update_graph_each
        self.batch_size = batch_size

        self.graph_file_path = graph_file_path

        self.start_string = start_string
        self.end_string = end_string
        self.end_doc = end_doc
        self.start_doc = start_doc
        self.log_template = log_template
        self.chunk_tail = ''

        # Set project path to the current script path
        self.project_path = os.getcwd() + os.sep

        # Folder to place output files
        self.page_files_path = urljoin(self.project_path, self.page_files_folder) + os.sep

        # Folder to place cache files
        self.cache_files_path = urljoin(self.project_path, self.cache_files_folder) + os.sep

        self.create_folder(self.page_files_path)
        self.create_folder(self.cache_files_path)

        # keep track of time, to calculate time spent
        self.main_start = datetime.datetime.now()
        self.start = self.main_start

        # by default, set the resume mode to check output folder
        self.resume_mode = self.resume_modes['complete_file_count']

        # Uncomment this line to ask for user input, on the shell.
        # self.resume_mode = raw_input("s) Start from scratch c) Resume from missing files:")

        # Create or open a shelf to keep a cache of line number, page number, and performance stats
        self.chunk_pointers = shelve.open(self.cache_files_path + 'chunk_pointers.log')
        self.persistent_performance_tracker = shelve.open(self.cache_files_path + 'persistent_performance_tracker.log')


        # Init shelf counter


        # *** Resume from missing files on the output folder
        # (Resume an interrupted operation by checking the existing files on the output folders)
        if self.resume_mode == self.resume_modes['complete_file_count']:
            previously_existent_file_count = 0
            for output_root, output_dirnames, output_filenames in os.walk(self.page_files_path):
                for dirname in output_dirnames:
                    for root, dirnames, filenames in os.walk(self.page_files_path + dirname):
                        for filename in filenames:
                            if filename.endswith(self.output_file_extension) and filename.startswith(self.output_file_prefix):
                                previously_existent_file_count += 1

            resume_from_page = int(math.floor(previously_existent_file_count / self.pool_size) * self.pool_size)
            if '%s' % (resume_from_page) in self.chunk_pointers:
                self.page_count = resume_from_page
                self.byte_count = self.chunk_pointers['%s' % self.page_count]
            else:
                self.byte_count = 0
                self.page_count = 0

        # *** Do not resume
        elif resume == self.start_from_scratch:
            self.byte_count = 0
            self.page_count = 0

    # Create folder if doesn't exist
    def create_folder(self, path):
        if not os.path.exists(path):
            os.makedirs(path)

    # Get 30 pages a time and store them in memory
    def slice_file(self, start=0, end=30):
        max_pages = end - start
        chunk = self.chunk_tail
        pages_stored = 0
        while True and max_pages:
            new_chunk = self.input_file.read(10000)
            if new_chunk:

                chunk += new_chunk
                pages_stored = len(chunk.split(self.end_string))

                if pages_stored > max_pages:
                    pages_for_next_slice = max_pages - pages_stored
                    if pages_for_next_slice == 0:
                        pages_for_next_slice = -1
                    self.chunk_tail = ''.join(chunk.split(self.end_string)[pages_for_next_slice:])
                    return ''.join(chunk.split(self.end_string)[0:max_pages])

            else:
                return ''.join(chunk.split(self.end_string))


    def get_folder_name(self):
        folder_name = int(math.floor(self.page_count / self.batch_size) * self.batch_size)
        folder_name = '%s%s' % (folder_name, os.sep)
        return folder_name

    def save_page(self, path, file_contents):
        with open(path, 'w') as output_file:
            output_file.write(file_contents)

    def process_queue(self):
        for page in self.pages_queue:
            self.pool.apply_async(
                self.save_page, 
                args = (page[0], page[1])
            )

    def save_graph(self):

        performance_seconds = []
        performance_page_count = []
        vals = self.persistent_performance_tracker.items()
        ordered_vals = sorted(vals, key=lambda i: int(i[0]))

        for val in ordered_vals:
            performance_seconds += [val[1]]
            performance_page_count += [val[0]]


        plt.clf()
        plt.plot(performance_page_count, performance_seconds)
        plt.ylabel('Task duration progress')

        plt.savefig(self.graph_file_path)

    def handle_status_reports(self):

        # Update graph
        if self.page_count % self.update_graph_each == 0:

            self.end = datetime.datetime.now()
            average = (self.end - self.start) / self.show_status_each
            average = average.total_seconds()


            self.persistent_performance_tracker['%s' % self.page_count] = average
            self.persistent_performance_tracker.sync()

            self.save_graph()

        # Print status to stdout
        if self.page_count % self.show_status_each == 0:

            self.end = datetime.datetime.now()
            average = (self.end - self.start) / self.show_status_each
            average = average.total_seconds()

            log = self.log_template.format(
                    page= self.page_count,
                    exported = self.end,
                    average = average,
                    show_status_each = self.show_status_each,
                    process = self.end - self.start,
                    queue = self.end - self.main_start
                )


            self.persistent_performance_tracker['%s' % self.page_count] = average
            self.persistent_performance_tracker.sync()

            sys.stdout.write(log)
            sys.stdout.flush()
            self.start = datetime.datetime.now()



    # Go through xml file lines and output data to html files
    def read_xml(self):

        tag_contents = ''

        # Seek page where to pick up from
        self.slice_file(0, self.page_count)


        # self.slice_file(0, self.page_memory)

        while self.all_pages_extracted == False:
            # check if there are still bytes to read
            try:
                chunk = self.slice_file(0, self.page_memory)
            except:
                break

            if not chunk:
                break

            pages_in_chunk = chunk.split(self.start_string)[1:]


            for page_i, page_contents in enumerate(pages_in_chunk):

                # new page start was found, count
                self.page_count += 1

                # create batch folder
                if self.page_count % self.batch_size == 0 or self.page_count == 1:
                    self.create_folder(self.page_files_path + self.get_folder_name())


                output_file_name = '{pre}{page_count}{ext}'.format(
                    pre = self.output_file_prefix, 
                    page_count = self.page_count, 
                    ext = self.output_file_extension
                )

                # if it's the last page, set the flag and ignore closing tag
                if self.end_doc in page_contents:
                    page_contents = page_contents.split(self.end_doc)[0]
                    self.all_pages_extracted = True

                self.pages_queue += [(
                    self.page_files_path + self.get_folder_name() + output_file_name,
                    tag_contents + page_contents
                )]

                self.handle_status_reports()

                if self.page_count % self.pool_size == 0:
                    # keep track of byte pointers where worker pools start
                    self.chunk_pointers['%s' % self.page_count] = self.byte_count
                    self.chunk_pointers.sync()
                    self.process_queue()
                    self.pages_queue = []

                self.byte_count += len(page_contents)


        self.close()


    def close(self):
        self.process_queue()
        self.save_graph()
        self.chunk_pointers.close()
        self.persistent_performance_tracker.close()
        self.pool.close()
        self.pool.join()

if __name__ == '__main__':

    xso = xml_split()

    xso.read_xml()

该脚本生成一个mathplotlib图来跟踪平均条目提取时间，并且看起来很稳定。这是我的电脑上50000个条目的图表（条目导出持续时间，以秒为单位，在y轴上，以及在x轴上导出的条目数）：

请随时取消该过程并再次启动它。默认行为是检查现有导出的文件并从中恢复。

以下是基于链接答案的解决方案：

offsets = []
with open('test.xml', 'rb') as inputfile: # Iterate over the dump
    # Save 1st line offset
    current = inputfile.tell()
    for line in iter(inputfile.readline,''):
        if line.lstrip().startswith('<page>'):
            offsets.append(current)
        # Update to the current offset (line about to be read)
        current = inputfile.tell()

# Demo the offsets are lines with <page>
with open('test.xml', 'rb') as inputfile: # Iterate over the dump
    for offset in offsets:
        inputfile.seek(offset)
        print offset,inputfile.readline()

offset=[]
使用open（'test.xml'，'rb'）作为inputfile:#迭代转储
#保存第一行偏移量
current=inputfile.tell（）
对于iter中的行（inputfile.readline“”）：
如果line.lstrip（）.startswith（“”）：
偏移量。追加（当前）
#更新到当前偏移量（即将读取的行）
current=inputfile.tell（）
#演示偏移量是带有
使用open（'test.xml'，'rb'）作为inputfile:#迭代转储
对于偏移中的偏移：
inputfile.seek（偏移量）
打印偏移量，inputfile.readline（）

输出：

9   <page>

82   <page>

9
82

能否添加XML文件的一部分，以便我们更好地掌握？谢谢我用我正在使用的格式的一段示例XML更新了这个问题。谢谢为什么不在阅读每一行时更新一个计数器呢？我不熟悉lxml之类的东西。看起来好像这样会将整个文件加载到内存中，但内存无法工作，因为我的文件是57GB的Wikipedia转储文件。现在，我正在逐行运行文件，并在它找到

标记时进行拆分。我犯了一个错误，只是简单地将行号而不是位置写入索引中。你可以在这里看到我当前的脚本：对不起，我还不能浏览你的代码。我用另一种方法更新了答案。让我知道你的想法？嘿，我真的很感谢你在我的问题上的努力。我花了一段时间编写您的代码，但始终无法使其正确执行。我想这与我的目录树很奇怪有关。我假设您的程序在文件中的每一行上循环以找到它停止的位置，这对吗？我已经在我的脚本中这样做了，如果这个过程在文件中进行得很长，那么需要几个小时才能循环到它停止的地方。我真的希望找到一种方法来确定文件中的位置，这样我就可以调用

file.seek（）

，直接跳到它停止的地方。我正在处理一个57 GB的xml文件，我认为一次加载每一行会太慢。@miller9904，这在问题中没有说明，这就是你在问题中所做的。我按要求回答了这个问题。

9   <page>

82   <page>