C++ 预测'中的下一个字符;随机';基于某些输入文件的文本生成

C++ 预测'中的下一个字符;随机';基于某些输入文件的文本生成,c++,nlp,C++,Nlp,我正在写一个基于马尔可夫模型生成随机文本的程序。我遇到了一个问题,有些文件在单词之间有很多空格,最初的种子被认为是空格问题是,接下来的所有字符都被视为空格,因此生成的随机文本只是一个空白文档,因为nextChosenChar始终是一个空格。 有人能提出解决这个问题的办法吗 我试图想出一个解决方案,如下面代码的后半部分所示,但没有成功 char ChooseNextChar(string seed, int order, string fileName){ Map<string, V

我正在写一个基于马尔可夫模型生成随机文本的程序。我遇到了一个问题,有些文件在单词之间有很多空格,最初的种子被认为是空格问题是,接下来的所有字符都被视为空格,因此生成的随机文本只是一个空白文档,因为nextChosenChar始终是一个空格。

有人能提出解决这个问题的办法吗

我试图想出一个解决方案,如下面代码的后半部分所示,但没有成功

char ChooseNextChar(string seed, int order, string fileName){
    Map<string, Vector<char> > nextCharMap;
    ifstream inputStream;
    inputStream.open(fileName.c_str());
    int offset = 0;
    Vector<char> charsFollingSeedVector;
    inputStream.clear();
    char* buffer = new char [order + 1];
    char charFollowingSeed;
    static int consecutiveSpaces = 0;
    while (!inputStream.eof()) {    
        inputStream.seekg(offset);
        inputStream.read(buffer, order + 1);
        string key(buffer, order);
        if (equalsIgnoreCase(key, seed)) {
            //only insert key if not present otherwise overwriting old info 
            if (!nextCharMap.containsKey(seed)) {
                nextCharMap.put(seed, charsFollingSeedVector);
            }
            //read the char directly following seed
            charFollowingSeed = buffer[order];
            nextCharMap[seed].push_back(charFollowingSeed);
        }
        offset++;
    }
    //case where no chars following seed
    if (nextCharMap[seed].isEmpty()) {
        return EOF;
    }
    //determine which is the most frequent following char
    char nextChosenChar = MostFequentCharInVector(seed, nextCharMap);

    //TRYING TO FIX PROBLEM OF ONLY OUTPUTTING SPACES**********
     if (nextChosenChar == ' ') {
        consecutiveSpaces++;
        if (consecutiveSpaces >= 1) {
            nextChosenChar = nextCharMap[seed].get(randomInteger(0, nextCharMap[seed].size()-1));
            consecutiveSpaces = 0;
        }
    }
    return nextChosenChar;
}
char选择extchar(字符串种子、整数顺序、字符串文件名){
下一步地图;
ifstream输入流;
open(fileName.c_str());
整数偏移=0;
向量charsFollingSeedVector;
inputStream.clear();
字符*缓冲区=新字符[顺序+1];
半焦种子;
静态int连续空间=0;
而(!inputStream.eof()){
输入流。参见千克(偏移量);
读取(缓冲区,顺序+1);
字符串键(缓冲区、顺序);
if(相等信号情况(键、种子)){
//仅在不存在时插入密钥,否则覆盖旧信息
如果(!nextCharMap.containsKey(种子)){
nextCharMap.put(种子、charsFollingSeedVector);
}
//直接在seed之后读取char
charfollowerseed=缓冲区[顺序];
下一步伤害[种子]。推回(种子);
}
offset++;
}
//种子后无字符的情况
if(nextCharMap[seed].isEmpty()){
返回EOF;
}
//确定下面哪个字符最频繁
char nextChosenChar=最频繁的harinvector(seed,nextCharMap);
//试图解决仅输出空格的问题**********
如果(nextChosenChar==''){
连续空间++;
如果(连续空间>=1){
nextChosenChar=nextCharMap[seed].get(randomInteger(0,nextCharMap[seed].size()-1));
连续空间=0;
}
}
返回下一个居住地;
}

一种解决方案是从文件中一个接一个地传输字符,这样您的读取循环看起来更像这样:

char buffer[order];
inputStream.get(buffer,order);

char next_char;
while ( inputStream.get(next_char) )
{
   string key(buffer, order);
   if (equalsIgnoreCase(key, seed)) {
   // only insert key if not present otherwise overwriting old info 
   if (!nextCharMap.containsKey(seed)) {
      nextCharMap[seed] = Vector(charFollowingSeed);
   }
   else
   {
     nextCharMap[seed].push_back(charFollowingSeed);
   }
   // Update the buffer.
   for(unsigned int i=1; i<order; ++i) buffer[i-1]=buffer[i];
   buffer[order-1]=next_char;
}
....
while ( inputStream.get(next_char) )
{
   //Remove multiple spaces from input.
   if( next_char==' ' and buffer[order-1]==' ') continue

   string key(buffer, order);
   ....

如果你真的想要一个基于字符的模型,你将不会得到非常自然的文本作为输出,但这绝对是可能的,而且该模型将基本上能够处理空间字符序列。如果你认为它们是文本的自然部分,就不必把它们从输入中删除。

重要的是,马尔可夫模型并不总是返回到预测在任何给定阶段具有最高概率的一个角色。相反,它必须查看可能的字符的整个概率分布,并随机选择一个

这里,随机意味着它选择一个不是由程序员预先确定的字符。然而,随机分布并非均匀分布,即并非所有字符都具有相同的可能性。它必须考虑各种可能特征的相对概率。一种方法是生成字符的累积概率分布,例如,如果概率为

p('a') == 0.2
p('b') == 0.4
p('c') == 0.4
我们代表他们

p('a') == 0.2
p('b') == p('a') + 0.4 == 0.6
p('c') == p('a') + p('b') == 1.0
然后,为了生成一个随机字符,我们首先生成一个0到1之间的均匀分布的随机数N,然后选择累积概率不小于N的第一个字符

我已经在下面的示例代码中实现了这一点。
train()
过程为训练输入中的每个字符生成以下字符的累积概率分布。“predict()”过程将此应用于生成随机文本

对于全面实施,这仍然缺乏:

  • 初始字符的概率分布表示。正如您在“main()”函数中看到的,我的输出总是以“t”开头
  • 输出字符串或最终字符长度的表示形式。'main()'总是生成长度为100的字符串
该代码在Linux上使用GCC4.7.0(C++11选项)进行了测试。下面是输出示例

#include <iostream>
#include <string>
#include <vector>
#include <utility>
#include <map>
#include <numeric>
#include <algorithm>
#include <random>

template <typename Char>
class Markov
{
public:
  /* Data type used to count the frequencies (integer!) of
     characters. */
  typedef std::map<Char,unsigned>            CharDistributionMap;

  /* Data type used to represent a cumulative probability (float!)
     distribution. */
  typedef std::vector<std::pair<Char,float>> CharDistribution;

  /* Data type used to represent the Markov model. Each character is
     mapped to a probality distribution of the characters that follow
     it. */
  typedef std::map<Char,CharDistribution>    MarkovModel;


  /* The model. */
  MarkovModel  _model;

  /* Training procedure. */
  template <typename Iterator>
  void train(Iterator from, Iterator to)
  {
    _model = {};
    if (from == to)
      return;

    std::map<Char,CharDistributionMap> proto_model {};

    /* Count frequencies. */
    Char current = *from;
    while (true) {
      ++from;
      if (from == to)
        break;
      Char next = *from;
      proto_model[current][next] += 1;
      current = next;
    }

    /* Transform into probability distribution. */
    for (const auto &entry : proto_model) {
      const Char current              = entry.first;
      const CharDistributionMap &freq = entry.second;

      /* Calculate total frequency of current character. */
      unsigned total =
         std::accumulate(std::begin(freq),std::end(freq),0,
           [](unsigned res,const std::pair<Char,unsigned> &p){
                   return res += p.second;
               });

      /* Determine the probability distribution of characters that
         follow the current character. This is calculated as a cumulative
         probability. */
      CharDistribution dist {};
      float probability { 0.0 };
      std::for_each(std::begin(freq),std::end(freq),
             [total,&probability,&dist](const std::pair<Char,unsigned> &p){
                   // using '+=' to get cumulative probability:
                   probability += static_cast<float>(p.second) / total; 
                   dist.push_back(std::make_pair(p.first,probability));
             });

      /* Add probability distribution for current character to the model. */
      _model[current] = dist;
    }
  }


  /* Predict the next character, assuming that training has been
     performed. */
  template <typename RandomNumberGenerator>
  Char predict(RandomNumberGenerator &gen, const Char current)
  {
    static std::uniform_real_distribution<float> generator_dist { 0, 1 };

    /* Assume that the current character is known to the model. Otherwise,
       an std::out_of_range exception will be thrown. */
    const CharDistribution &dist { _model.at(current) };

    /* Generate random number between 0 and 1. */
    float random { generator_dist(gen) };

    /* Identify the character that has the greatest cumulative probabilty
       smaller than the random number generated. */
    auto res =
         std::lower_bound(std::begin(dist),std::end(dist),
                          std::make_pair(Char(),random),
             [](const std::pair<Char,float> &p1, const std::pair<Char,float> &p2) {
                    return (p1.second < p2.second);
             });
    if (res == std::end(dist))
      throw "Empty probability distribution. This should not happen.";
    return res->first;
  }

};

int main()
{
  /* Initialize random-number generator. */
  std::random_device rd;
  std::mt19937 gen(rd());


  std::string input { "this   is    some   input text   with   many spaces." };

  if (input.empty())
    return 1;

  /* We append the first character to the end, to ensure that even the
     last character of the text gets a non-empty probability
     distribution. A more proper way of dealing with character that
     have empty distributions would be _smoothing_. */
  input += input[0];

  Markov<char> markov {};
  markov.train(std::begin(input),std::end(input));

  /* We set the initial character. In a real stochastic model, there
     would have to be a separate probality distribution for initial
     character and we would choose the initial character randomly,
     too. */
  char current_char { 't' };

  for (unsigned i = 0 ; i < 100 ; ++i) {
    std::cout << current_char;
    current_char = markov.predict(gen,current_char);
  }
  std::cout << current_char << std::endl;
}

如您所见,空格字符的分布有点自然地遵循输入文本中的分布。

大多数使用马尔可夫模型进行自然语言处理等的人在应用训练过程之前都会应用标记器。因此,任何类型的空白都不会出现在模型中。你想为一个特定的目的建立一个基于角色(而不是基于令牌)的模型吗?我是基于角色来做的,这更像是一个个人项目,除了使用令牌化器,还有其他方法吗,要解决这类问题?Kernighan&Pike的优秀著作有一整章专门介绍了马尔可夫模型程序的实现,该程序基于输入文本生成半合理的语言。我现在才注意到的另一件事是,您的程序总是生成概率最高的一个字符。我认为马尔可夫链的方式是,在每个状态下(即在每个字符之后),它生成可能字符的整个概率分布,然后根据计算出的概率分布随机选择其中一个。一个明显的解决方案是不要使用包含大量空白的文件。错误的训练数据=错误的模型。
t  mext s.t th   winy  iny  somaces      sputhis inpacexthispace te  iny            me   mext mexthis

tes    is  manputhis.th is  wis.th with it    is  is.t  s   t   winy    it mext    is        ispany

this  maces      somany  t    s        it this  winy sputhisomacext manput    somanputes  macexte iso

t   wispanpaces maces  tesomacexte s  s  mes.th     isput t wit   t   somanputes   s  withit  sput ma