Unicode Twitter文本压缩挑战规则您的程序必须有两种模式：编码和解码_Unicode_Twitter_Compression_Code Golf

Unicode Twitter文本压缩挑战规则您的程序必须有两种模式：编码和解码

unicode twitter compression

Unicode Twitter文本压缩挑战规则您的程序必须有两种模式：编码和解码,unicode,twitter,compression,code-golf,Unicode,Twitter,Compression,Code Golf,当编码时：您的程序必须输入一些人类可读的Latin1文本，大概是英语。忽略标点符号并不重要你只需要担心实际的英语单词，而不是L337 任何带重音的字母都可以转换为简单的ASCII码您可以选择如何处理数字 123 一二三一百二十三 123 1 2 3 一百二十三一二三一百二十三 123 1 2 3 您的程序必须输出一条可以在中表示的消息范围U+0000–U+10FFFF 不包括非字符： U+FFFE U+FFFF U+nFFFE，U+nFFFF，其中n是1–10十六

当编码时：

您的程序必须输入一些人类可读的

Latin1

文本，大概是英语。

忽略标点符号并不重要
你只需要担心实际的英语单词，而不是L337
任何带重音的字母都可以转换为简单的ASCII码
您可以选择如何处理数字
123
- 一二三
- 一百二十三
- 123
- 1 2 3
一百二十三
- 一二三
- 一百二十三
- 123
- 1 2 3

您的程序必须输出一条可以在中表示的消息

范围
```
U+0000
```
–
```
U+10FFFF
```
不包括非字符：
- ```
U+FFFE
```
- ```
U+FFFF
```
- ```
U+
```
```
n
```
```
FFFE
```
  ，
```
U+
```
```
n
```
```
FFFF
```
  ，其中
```
n
```
  是
```
1
```
  –
```
10
```
  十六进制
- ```
U+FDD0
```
  –
```
U+FDEF
```
- ```
U+D800
```
  –
```
U+DFFF
```
  （代理代码点）

它可以以您选择的任何合理编码输出；支持的任何编码都将被认为是合理的，您的平台本机编码或区域设置编码可能是一个不错的选择

解码时：

您的程序应将编码模式的输出作为输入

文本输出应该是输入文本的近似值。

离原文越近越好

不需要任何标点符号

输出文本应该是可读的人，也可能是英语

可以是L337或lol

解码过程可能无法访问编码过程的任何其他输出除上述规定的输出外；也就是说，您不能将文本上传到某个地方并输出URL 下载解码过程，或者诸如此类的傻事

为了用户界面的一致性，您的程序必须按如下方式运行：
您的程序必须是一个脚本，可以在具有适当解释器的平台上设置为可执行，或者可以编译成可执行文件的程序

您的程序必须将
encode
或
decode
作为其第一个参数来设置模式

您的程序必须至少通过以下一种方式获取输入：

从标准输入获取输入，并在标准输出时生成输出。

my program encode output.utf

my program decode output.txt

从第二个参数中命名的文件中获取输入，并在第三个参数中命名的文件中生成输出。

my program encode input.txt output.utf

my program decode output.utf output.txt

有关您的解决方案，请发布：
您的代码，完整的，和/或它的链接托管在别处（如果它很长，或者需要编译很多文件，或者其他）

如果从代码中看不出它是如何工作的，请给出一个解释或者如果代码很长，人们会对摘要感兴趣

示例文本，包含原始文本、压缩到的文本和解码文本

如果你是建立在别人的想法上，请将其归因于他人。尝试对其他人的想法进行细化是可以的，但你必须将它们归因于你

这些规则是的规则的变体。
PAQ8O10T不确定我是否有时间/精力使用实际代码来跟进这些规则，但我的想法如下：

任何长度低于一定长度的任意拉丁1字符串都可以简单地编码（甚至不压缩），而不会丢失为140个字符。天真的估计是280个字符，尽管有竞赛规则中的代码点限制，它可能比这个短一点

比上述长度稍长的字符串（让guestimate在280到500个字符之间）最有可能使用标准压缩技术压缩为足够短的字符串，以允许进行上述编码

再长一点，我们就开始丢失文本中的信息了。因此，执行以下步骤的最小数量，将字符串减少到可以使用上述方法压缩/编码的长度。另外，如果只在子字符串上执行这些替换将使整个字符串足够短，则不要对整个字符串执行这些替换（我可能会向后遍历字符串）

将127以上的所有拉丁1字符（主要是重音字母和时髦符号）替换为其最接近的等效非重音字母字符，或可能替换为通用符号，如“#”

将所有大写字母替换为其等效的小写形式

用空格替换所有非字母数字（任何剩余符号或标点符号）

将所有数字替换为0
好的，现在我们已经消除了尽可能多的多余字符。现在我们要做一些更大幅度的削减：

将所有双字母（气球）替换为单个字母（巴伦）。看起来很奇怪，但希望读者能理解

用较短的等效字母替换其他常用字母组合（CK与K、WR与R等）
好的，这就是我们能做的，让文本可读的范围。除此之外，让我们看看我们是否能想出一种方法，使文本与原始文本相似，即使它最终不可判定（再次执行此操作）
#! perl use strict; use warnings; use 5.010; use Getopt::Long; use Pod::Usage; use autodie; my %opts = ( infile => '-', outfile => '-', ); GetOptions ( 'encode|e' => \$opts{encode}, 'decode|d' => \$opts{decode}, 'infile|i=s' => \$opts{infile}, 'outfile|o=s' => \$opts{outfile}, 'help|h' => \&help, 'man|m' => \&man, ); unless( # exactly one of these should be set $opts{encode} xor $opts{decode} ){ help(); } { my $infile; if( $opts{infile} ~~ ['-', '&0'] ){ $infile = *STDIN{IO}; }else{ open $infile, '<', $opts{infile}; } my $outfile; if( $opts{outfile} ~~ ['-', '&1'] ){ $outfile = *STDOUT{IO}; }elsif( $opts{outfile} ~~ '&2' ){ $outfile = *STDERR{IO}; }else{ open $outfile, '>', $opts{outfile}; } if( $opts{decode} ){ while( my $line = <$infile> ){ chomp $line; say {$outfile} $line; } }elsif( $opts{encode} ){ while( my $line = <$infile> ){ chomp $line; $line =~ s/[\W_]+/ /g; say {$outfile} $line; } }else{ die 'How did I get here?'; } } sub help{ pod2usage(); } sub man{ pod2usage(1); } __END__ =head1 NAME sample.pl - Using GetOpt::Long and Pod::Usage =head1 SYNOPSIS sample.pl [options] [file ...] Options: --help -h brief help message --man -m full documentation --encode -e encode text --decode -d decode text --infile -i input filename --outfile -o output filename =head1 OPTIONS =over 8 =item B<--help> Print a brief help message and exits. =item B<--man> Prints the manual page and exits. =item B<--encode> Removes any character other than /\w/. =item B<--decode> Just reads from one file, and writes to the other. =item B<--infile> Input filename. If this is '-' or '&0', then read from STDIN instead. If you use '&0', you must pass it in with quotes. =item B<--outfile> Output filename. If this is '-' or '&1', then write to STDOUT instead. If this is '&2', then write to STDERR instead. If you use '&1' or '&2', you must pass it in with quotes. =back =head1 DESCRIPTION B<This program> will read the given input file(s) and do something useful with the contents thereof. =cut
echo Hello, this is, some text | perl sample.pl -e Hello this is some text
A white dwarf is a small star composed mostly of electron-degenerate matter. Because a white dwarf's mass is comparable to that of the Sun and its volume is comparable to that of the Earth, it is very dense.

A white dwarf be small star composed mostly electron degenerate matter because white dwarf mass be comparable sun IT volume be comparable earth IT be very dense