在python和linux shell中处理（二进制？）文件_Python_File_Text_Character Encoding_Utf 16

在python和linux shell中处理（二进制？）文件

python file text character-encoding

在python和linux shell中处理（二进制？）文件,python,file,text,character-encoding,utf-16,Python,File,Text,Character Encoding,Utf 16,我最近用python编写了一个脚本，该脚本处理Microsoft Windows DHCP服务器转储文件，并使用电子表格XML格式生成当前预订的XML文件脚本基本上使用pythonopen（）命令打开一个文件，然后迭代每一行（对于文件中的行），并查找关键字reservedip。如果找到关键字，则使用shlexsplit（）命令将该行拆分为多个字段但是，当我使用microsoft DHCP服务器的默认转储文件运行此脚本时，没有得到任何结果。还请注意，我无法使用Linux的grep命令在文件中

我最近用python编写了一个脚本，该脚本处理Microsoft Windows DHCP服务器转储文件，并使用电子表格XML格式生成当前预订的XML文件

脚本基本上使用pythonopen（）命令打开一个文件，然后迭代每一行（对于文件中的行），并查找关键字reservedip。如果找到关键字，则使用shlexsplit（）命令将该行拆分为多个字段

但是，当我使用microsoft DHCP服务器的默认转储文件运行此脚本时，没有得到任何结果。还请注意，我无法使用Linux的grep命令在文件中搜索

然后，我尝试在gedit中打开该文件并将其保存为unix文本文件。完成后，我得到了结果，能够在文件中grep。然而，这种方法破坏了编写脚本以自动化我的工作的全部意义

我一直在谷歌上搜索，但没有找到我要找的东西。我还尝试以二进制模式打开文件，但这也没有帮助

我希望有人能帮我

根据请求，下面是脚本执行的操作（至少是循环部分）和DHCP服务器输出的示例：

脚本

# Setup an empty dictionary to store the extracted records
records = {}

# Open dhcp dump file
f = open(dhcp.txt, "r")

# Iterate file line by line
for line in f:

  # Only use line with the word "reservedip" in it
  if "reservedip" in line:

    # Split line into fields by spaces (excluding quoted substrings)
    field = shlex.split(line)

    # Add new entry for each record using the 32bit IP address int as it's key
    records[addr_to_int(field[7])] = [field[7], field[8], field[9], field[10]]

*注意：addr_to_int是我编写的一个函数，它将虚线IPv4地址转换为整数*

DHCP转储

不幸的是，由于公司的政策，我不能包含真正的DHCP服务器转储。但我试图从文件中取出的行如下所示：

Dhcp服务器\\servername.company.local作用域172.16.104.0添加reservedip 172.16.104.207 003386dd00gg“hostname.company.local”“主机描述”“两者”

提前感谢,，

Pascal

消除端点字符问题的一种方法是使用re:

import re

dhcp_file = open( path_to_dhcp_file, 'r' )
for line in dhcp_file:
    # Change en line char to UNIX style
    line = re.sub( "\r\n", r"\n", line )

    # now do your things on line

基于您提供的两行内容，作为DHCP转储文件内容的示例，我制作了以下测试用例（为了清晰起见，在本示例中，我在每行的开头添加了l1、l2、l3，…，参考行号）

下面是我在Linux Fedora Core 17（x86_64）上创建的转储文件 data.txt：

l1: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l2: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"
l3: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l4: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"
l5: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add  172.16.104.207 
l6: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"
l7: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add  172.16.104.207 
l8: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"
l9: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l10: 003386dd00gg "hostname.company.local" "Host Description" "BOTH"

你说：

还请注意，我无法使用Linux的grep命令在文件中搜索

下面是使用上述示例文件运行grep时得到的结果

$ cat data.txt | grep reservedip
l1: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l3: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
l9: Dhcp Server \\servername.company.local Scope 172.16.104.0 Add reservedip 172.16.104.207 
$

下面是我使用python脚本进行的测试，以检查脚本是否能够在示例文件中找到关键字“reservedip”：

lineNumber = 0
with open("./data.txt") as dhcpDumpFile:
    for line in dhcpDumpFile:
        lineNumber += 1
        if "reservedip" in line:
            print("Found 'reservedip' at the line: ", lineNumber)

我得到的结果是：

$ python -tt myscript.py
("Found 'reservedip' at the line: ", 1)
("Found 'reservedip' at the line: ", 3)
("Found 'reservedip' at the line: ", 9)
$

所以，它对我很有效

问候,

Dariyoosh

文件中这些字符串的编码可能不是ASCII兼容字符编码。UTF-8和拉丁语应该兼容，因为它们只使用一个字节作为ASCII字符。和UTF-32不兼容，它们每个字符使用的字节总是超过一个。UTF-16在MS文件中并不少见，有时文件甚至是混合文件

转储可能使用2个字节，即使对于ASCII字符也是如此。然后文件中有

r~e~s~e~r~v~e~d~i~p

，其中

是另一个字节（也可以是

~r

，甚至

，仍然编码到

）

这只是一个猜测，因为您不允许发布实际文件，而且我对MS DHCP服务器转储一无所知

什么是

file file.txt

给你什么

那怎么办

file --mime-type --mime-encoding

如果是“混合”编码，则不一定会告诉您编码二进制/字符串文件，但如果它是纯UTF/ASCII文本，它应该会告诉您。

那么它是二进制文件还是文本文件？据我所知：生成的DHCP文件的编码与Python脚本所期望的不同。此外，Windows和Unix中的结束行字符也不相同，这可能会改变脚本的行为在for循环中生成行标记，因为根据您所说的：您保存了与Unix文本文件完全相同的文件，并且成功地读取了文件内容。因此，我认为，必须生成“Unix兼容”的输出文件@LevLevitsky，我不确定，该文件包含文本（有点像dhcp的配置文件，由powershell命令发出：netsh dhcp server scope dump>file。txt@dariyoosh，我还对脚本做了一些测试。据我所知，它可以逐行迭代文件，但问题是在字符串或行中搜索关键字。我使用的语句是：if行中的“reservedip”：谢谢，但我已经尝试过了，但它不起作用。问题似乎不是换行，而是字符串搜索。（我在行中使用if“reservedip”：这不会返回任何行）@Pascal Van Acker，如果可能的话，将您的脚本代码发布在这里，这样可以更好地理解问题的原因。此外，您的DHCP文件的示例也可能有助于进行一些测试。@Nemelis，我已经在帖子中添加了一些信息。希望它能帮助感谢您的努力，@Dariyosh。如前所述，我确信它将与powershell将dhcp转储信息重定向到文本文件的方式。因为当我使用gedit打开文件并以unix格式保存时，它也适用于我。但我正在尝试自动化整个过程，以便只显示包含最新dhcp信息的网页或电子表格。server1；txt:UTF-8 Unicode text，server2.txt:Little-endian UTF-16 Uni代码文本，带有CRLF、CR行终止符；（我使用GEdit转换的第一个txt文件，因此看起来您是对的，文件上的编码不正确。知道如何解决吗？）