Linux 分析包含不可打印ASCII字符的文件_Linux_Bash_Parsing_Shell_Ascii

Linux 分析包含不可打印ASCII字符的文件

linux bash parsing shell

Linux 分析包含不可打印ASCII字符的文件,linux,bash,parsing,shell,ascii,Linux,Bash,Parsing,Shell,Ascii,我有一个文件（可能是二进制文件），其中大部分包含不可打印的ASCII字符，如下面八进制转储实用程序的输出所示 od -a MyFile.log 0000000 cr nl esc a soh nul esc * soh L soh nul nul nul nul nul 0000020 nul soh etx etx etx soh nul nul nul nul nul nul nul nul nul nul 0000040 nul nul nul nul nul nul

我有一个文件（可能是二进制文件），其中大部分包含不可打印的ASCII字符，如下面八进制转储实用程序的输出所示

od  -a MyFile.log 
0000000  cr  nl esc   a soh nul esc   * soh   L soh nul nul nul nul nul
0000020 nul soh etx etx etx soh nul nul nul nul nul nul nul nul nul nul
0000040 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0000100 nul nul nul nul nul soh etx etx etx nul nul nul nul nul nul nul
0000120 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 nul nul nul nul nul nul nul nul soh etx etx etx soh nul nul nul
0000160 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000200 nul nul nul nul nul nul nul nul nul nul nul soh etx etx etx etx
0000220 etx soh etx etx etx etx etx etx etx soh etx etx etx etx etx etx
0000240 etx soh etx etx etx etx etx soh soh soh soh soh nul nul nul nul
0000260 nul nul nul nul nul nul nul nul nul nul nul nul nul nul etx etx
0000300 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul

我想做以下几点：

将文件解析或拆分为类似段落的部分，以字符

esc

、

fs

、

gs

和

us

开头（ASCII数字27、28、29和31）

让输出文件包含人类可读的ASCII字符，如八进制转储

将结果存储在文件中

这样做的最佳方式是什么？我更喜欢使用UNIX/Linux shell实用程序（例如grep）来执行此任务，而不是使用C程序

谢谢

Edit我使用了八进制转储实用程序命令

od-A n-A-v MyFile.log

来删除文件中的偏移量，如下所示：

  cr  nl esc   a soh nul esc   * soh   L soh nul nul nul nul nul
 nul soh etx etx etx soh nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul soh etx etx etx nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul soh etx etx etx soh nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul soh etx etx etx etx
 etx soh etx etx etx etx etx etx etx soh etx etx etx etx etx etx
 etx soh etx etx etx etx etx soh soh soh soh soh nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul etx etx

我想从这个文件开始，或者通过管道将这个文件传输到其他一些实用程序，例如awk

如果您可以访问支持RS中正则表达式的awk（例如，gawk），您可以执行以下操作：

awk 'BEGIN{ ORS = ""; RS = "\x1b|\x1c|\x1d|\x1f"; cmd = "od -a" } { print | cmd; close( cmd ) }' MyFile.log > output awk 'BEGIN{ ORS=""; RS = "\x1b|\x1c|\x1d|\x1f"; cmd = "od -a" } { print | cmd "> output"NR }' MyFile.log awk'BEGIN{ORS=“；RS=“\x1b |\x1c |\x1d |\x1f”；cmd=“od-a”} {print | cmd；close（cmd）}'MyFile.log>输出这将把所有输出转储到一个文件中。如果希望每个“段落”位于不同的输出文件中，可以执行以下操作：

awk 'BEGIN{ ORS = ""; RS = "\x1b|\x1c|\x1d|\x1f"; cmd = "od -a" } { print | cmd; close( cmd ) }' MyFile.log > output awk 'BEGIN{ ORS=""; RS = "\x1b|\x1c|\x1d|\x1f"; cmd = "od -a" } { print | cmd "> output"NR }' MyFile.log awk'BEGIN{ORS=“；RS=“\x1b |\x1c |\x1d |\x1f”；cmd=“od-a”} {print | cmd“>输出“NR}”MyFile.log 写入文件output1、output2等

请注意，awk的标准规定，如果RS包含多个字符，则行为是未指定的，但awk的许多实现将支持这样的正则表达式。

我认为更简单的方法是使用flex程序：

/*
 * This file is part of flex.
 * 
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 
 * Neither the name of the University nor the names of its contributors
 * may be used to endorse or promote products derived from this software
 * without specific prior written permission.
 * 
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
 * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE.
 */

    /************************************************** 
        start of definitions section

    ***************************************************/

%{
/* A template scanner file to build "scanner.c". */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
/*#include "parser.h" */

//put your variables here
char FileName[256];
FILE *outfile;
char inputName[256];


// flags for command line options
static int output_flag = 0;
static int help_flag = 0;

%}


%option 8bit 
%option nounput nomain noyywrap 
%option warn

%%
    /************************************************ 
        start of rules section

    *************************************************/


    /* these flex patterns will eat all input */ 
\x1B { fprintf(yyout, "\n\n"); }
\x1C { fprintf(yyout, "\n\n"); }
\x1D { fprintf(yyout, "\n\n"); }
\x1F { fprintf(yyout, "\n\n"); }
[:alnum:] { ECHO; }
.  { }
\n { ECHO; }


%%
    /**************************************************** 
        start of code section


    *****************************************************/

int main(int argc, char **argv);

int main (argc,argv)
int argc;
char **argv;
{
    /****************************************************
        The main method drives the program. It gets the filename from the
        command line, and opens the initial files to write to. Then it calls the lexer.
        After the lexer returns, the main method finishes out the report file,
        closes all of the open files, and prints out to the command line to let the
        user know it is finished.
    ****************************************************/

    int c;

    // the gnu getopt library is used to parse the command line for flags
    // afterwards, the final option is assumed to be the input file

    while (1) {
        static struct option long_options[] = {
            /* These options set a flag. */
            {"help",   no_argument,     &help_flag, 1},
            /* These options don't set a flag. We distinguish them by their indices. */

            {"useStdOut", no_argument,       0, 'o'},
            {0, 0, 0, 0}
        };
           /* getopt_long stores the option index here. */
        int option_index = 0;
        c = getopt_long (argc, argv, "ho",
            long_options, &option_index);

        /* Detect the end of the options. */
        if (c == -1)
            break;

        switch (c) {
            case 0:
               /* If this option set a flag, do nothing else now. */
               if (long_options[option_index].flag != 0)
                 break;
               printf ("option %s", long_options[option_index].name);
               if (optarg)
                 printf (" with arg %s", optarg);
               printf ("\n");
               break;

            case 'h':
                help_flag = 1;
                break;

            case 'o':
               output_flag = 1;
               break;

            case '?':
               /* getopt_long already printed an error message. */
               break;

            default:
               abort ();
            }
    }

    if (help_flag == 1) {
        printf("proper syntax is: cleaner [OPTIONS]... INFILE OUTFILE\n");
        printf("Strips non printable chars from input, adds line breaks on esc fs gs and us\n\n");
        printf("Option list: \n");
        printf("-o                      sets output to stdout\n");
        printf("--help                  print help to screen\n");
        printf("\n");
        printf("If infile is left out, then stdin is used for input.\n");
        printf("If outfile is a filename, then that file is used.\n");
        printf("If there is no outfile, then infile-EDIT is used.\n");
        printf("There cannot be an outfile without an infile.\n");
        return 0;
    }

    //get the filename off the command line and redirect it to input
    //if there is no filename then use stdin


    if (optind < argc) {
        FILE *file;

        file = fopen(argv[optind], "rb");
        if (!file) {
            fprintf(stderr, "Flex could not open %s\n",argv[optind]);
            exit(1);
        }
        yyin = file;
        strcpy(inputName, argv[optind]);
    }
    else {
        printf("no input file set, using stdin. Press ctrl-c to quit");
        yyin = stdin;
        strcpy(inputName, "\b\b\b\b\bagainst stdin");
    }

    //increment current place in argument list
    optind++;


    /********************************************
        if no input name, then output set to stdout
        if no output name then copy input name and add -EDIT.csv
        otherwise use output name

    *********************************************/
    if (optind > argc) {
        yyout = stdout;
    }   
    else if (output_flag == 1) {
        yyout = stdout;
    }
    else if (optind < argc){
        outfile = fopen(argv[optind], "wb");
        if (!outfile) {
                fprintf(stderr, "Flex could not open %s\n",FileName);
                exit(1);
            }
        yyout = outfile;
    }
    else {
        strncpy(FileName, argv[optind-1], strlen(argv[optind-1])-4);
        FileName[strlen(argv[optind-1])-4] = '\0';
        strcat(FileName, "-EDIT");
        outfile = fopen(FileName, "wb");
        if (!outfile) {
                fprintf(stderr, "Flex could not open %s\n",FileName);
                exit(1);
            }
        yyout = outfile;
    }

    yylex();
    if (output_flag == 0) {
        fclose(yyout);
    }
    printf("Flex program finished running file %s\n", inputName);
    return 0;
}

编译并放置在您的路径上之后，只需使用

od-A n-A-v MyFile.log | cleaner

我编写了一个简单的程序
main.c：

#include <stdio.h>

char *human_ch[]=
{
"NILL",
"EOL"
};
char code_buf[3];

// you can implement whatever you want for coversion to human-readable format
const char *human_readable(int ch_code)
{
    switch(ch_code)
    {
    case 0:
        return human_ch[0];
    case '\n':
        return human_ch[1];
    default:
        sprintf(code_buf,"%02x", (0xFF&ch_code) );
        return code_buf;
    }
}

int main( int argc, char **argv)
{
    int ch=0;
    FILE *ofile;
    if (argc<2)
        return -1;

    ofile=fopen(argv[1],"w+");
    if (!ofile)
        return -1;

    while( EOF!=(ch=fgetc(stdin)))
    {

        fprintf(ofile,"%s",human_readable(ch));
        switch(ch)
        {
            case 27:
            case 28:
            case 29:
            case 31:
                fputc('\n',ofile); //paragraph separator
                break;
            default:
                fputc(' ',ofile); //characters separator
                break;
        }
    }

    fclose(ofile);
    return 0;
}

#包括
char*人类=
{
“零”，
“下线”
};
字符代码_buf[3]；
//您可以实现任何您想要的转换为人类可读的格式
常量字符*人类可读（int CHU代码）
{
开关（通道代码）
{
案例0：
返回人_ch[0]；
案例“\n”：
回归人类[1]；
违约：
sprintf（代码为“%02x”，（0xFF和Chu代码））；
返回代码_buf；
}
}
int main（int argc，字符**argv）
{
int ch=0；
文件*ofile；
如果（argc这里有一个小Python程序，它可以实现您想要的功能（至少是拆分位）：
e、 g
在bash命令行上：
python split.py input.txt $'\x1B'$'\x1C'

在对指定的任何代码（本例中为127和128）进行拆分后，将生成文件input.txt.out.0001
，input.txt.out.0002

然后，通过将这些文件传递到od
，您可以迭代这些文件并将其转换为可打印格式
for f in `ls input.txt.out.*`; do od $f > $f.od; done

od-a-An-v文件
→ 文件的八进制转储，包含命名字符（-a
），没有地址（-An
），并且没有被抑制的重复行（-v
）。

-0777
→ slurp整个文件（行分隔符是不存在的0777
字符）。

-n
→ 使用隐式循环读取输入（整行）。

for/（？：esc | fs | gs | us）（？：（？！esc | fs | gs | us）*/gs
→ 对于每个（/g
）节，可以选择从esc
、fs
、gs
或us
开始，并且包含最大字符序列（包括换行符：/s
），而不包含esc
、fs
、gs
或us


s/\n//g
→ 从od
打印“$\n”
→ 打印节和换行符（以及与od
格式匹配的空格）
@NiklasB.90%的主要方法是GNU getopt
解析我从未接触过的代码。相关的部分是flex-patterns
注释下面的7行。其他一切都是在我的框架项目文件中进行的优化。makefile
也是这样。我只是直接从我的框架项目中取出了它。od-a-An-v文件| perl-0777ne的/\n//g，为/（？：esc | fs | gs | us）打印“$\un”）（？：（？：（？！esc | fs | gs | us）。*/gs'@ninjalj请复制您的评论作为答案，我会这样标记。虽然不漂亮，但效果很好。与瑞士陆军电锯不同：）
for f in `ls input.txt.out.*`; do od $f > $f.od; done

od -a -An -v file | perl -0777ne 's/\n//g,print "$_\n " for /(?:esc| fs| gs| us)?(?:(?!esc| fs| gs| us).)*/gs'