String 支持Unicode的字符串（1）程序_String_Unicode

String 支持Unicode的字符串（1）程序

string unicode

String 支持Unicode的字符串（1）程序,string,unicode,String,Unicode,有人有unicode识别字符串程序的代码示例吗？编程语言并不重要。我想要的东西本质上与unix命令“strings”做相同的事情，但它也可以在unicode文本（UTF-16或UTF-8）上运行，可以运行英语字符和标点符号。（我只关心英文字符，不关心其他字母）谢谢您只是想使用它，还是出于某种原因坚持使用该代码在我的Debian系统上，似乎strings命令可以直接完成这项工作。请参阅手册页中的练习： --encoding=encoding Select the chara

有人有unicode识别字符串程序的代码示例吗？编程语言并不重要。我想要的东西本质上与unix命令“strings”做相同的事情，但它也可以在unicode文本（UTF-16或UTF-8）上运行，可以运行英语字符和标点符号。（我只关心英文字符，不关心其他字母）

谢谢

您只是想使用它，还是出于某种原因坚持使用该代码

在我的Debian系统上，似乎

strings

命令可以直接完成这项工作。请参阅手册页中的练习：

  --encoding=encoding
       Select the character encoding of the strings that are to be found.  Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO  8859,
       etc.,  default),  S  = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful
       for finding wide character strings.

编辑：好的。我不知道C#，所以这可能有点复杂，但基本上，你需要搜索交替的零和英文字符序列

byte b;
int i=0;
while(!endOfInput()) {
  b=getNextByte();
LoopBegin:
  if(!isEnglish(b)) {
    if(i>0) // report successful match of length i
    i=0;
    continue;
  }
  if(endOfInput()) break;
  if((b=getNextByte())!=0)
    goto LoopBegin;
  i++; // found another character
}

这应该适用于little endian。

我遇到了类似的问题，并尝试了“

字符串-e…

”，但我刚刚找到了固定宽度字符编码的选项。（UTF-8编码是可变宽度的）

请记住，默认情况下，ascii以外的字符需要额外的

字符串

选项。这几乎包括所有非英语语言字符串

然而，“

-es

”（单个8位字符）输出包括UTF-8字符

我编写了一个非常简单的Perl脚本，它应用了 “

strings-es…| iconv…

”输入文件

我相信根据具体的限制调整它是很容易的。用法：

utf8strings[options]文件*

#!/usr/bin/perl -s

our ($all,$windows,$enc);   ## use -all ignore the "3 letters word" restriction
use strict;
use utf8::all;

$enc = "ms-ansi" if     $windows;  ##
$enc = "utf8"    unless $enc    ;  ## defaul encoding=utf8
my $iconv = "iconv -c -f $enc -t utf8 |";

for (@ARGV){ s/(.*)/strings -e S '$1'| $iconv/;}

my $word=qr/[a-zçáéíóúâêôàèìòùüãõ]{3}/i;   # adapt this to your case

while(<>){
   # next if /regular expressions for common garbage/; 
   print    if ($all or /$word/);
}

#/usr/bin/perl-s
我们的（$all，$windows，$enc）；#使用-全部忽略“3字母单词”限制
严格使用；
使用utf8:：all；
$enc=“ms ansi”如果$windows##
$enc=“utf8”除非$enc；###默认编码=utf8
my$iconv=“iconv-c-f$enc-t utf8 |”；
对于（@ARGV）{s/（.*）/strings-es'$1'|$iconv/}
我的$word=qr/[a-zçççççççèùùôôôô঍]{3}/i；#根据你的情况进行调整
while（）{
#下一个if/通用垃圾的正则表达式/；
如果（$all或/$word/）打印；
}

在某些情况下，这种方法会产生一些额外的垃圾。

对于纯英语和UTF-8，字符串（1）应该已经可以了。如果语言不重要，为什么不检查字符串实用程序本身的源代码呢？我需要代码。。。我需要将它整合到我正在编写的系统中（如果有必要的话，用c语言）。谢谢，这正是我所需要的。很明显，现在我想起来了；只需跳过空字节。