C++ 查找文本文件中的常用词、顺序错误和计数

C++ 查找文本文件中的常用词、顺序错误和计数,c++,sorting,data-structures,C++,Sorting,Data Structures,我正在尝试查找csv文件中最常见的前k个单词。原始文件是csv,它有超过1M行,所以我跳过了一些关注问题区域的阶段 在此之前,我解决了标点符号,使其全部小写,所以在测试文本中只有单词和数字,当我解析数据时,我也跳过数字 这是我的密码: 标题--> 当一个站点从前十名列表中删除时,它的标志first\u ten\u rank必须重置为零,否则它将永远不会再次进入前十名 void site::check_ten(int place) { if (site_placement[place].fir

我正在尝试查找csv文件中最常见的前k个单词。原始文件是csv,它有超过1M行,所以我跳过了一些关注问题区域的阶段

在此之前,我解决了标点符号,使其全部小写,所以在测试文本中只有单词和数字,当我解析数据时,我也跳过数字

这是我的密码: 标题-->


当一个站点从前十名列表中删除时,它的标志
first\u ten\u rank
必须重置为零,否则它将永远不会再次进入前十名

void site::check_ten(int place)
{
  if (site_placement[place].first_ten_rank)
  {
      sort_rank();
      return;
  }
  else if (rankings[0]->count > site_placement[place].count)
      return;

  //When a site is removed from the top ten list, its first_ten_rank must be set to zero.
  //otherwise, a removed one from topten list will never enter the topten again
  rankings[0]->first_ten_rank = 0;

  rankings[0] = &site_placement[place];
  site_placement[place].first_ten_rank = 1;
  sort_rank();
}

在stackoverflowLinks上移动代码,随着时间的推移,代码会消失。你应该在这里显示你的代码。我希望手头的任务“在文本文件中查找常用词”最多需要十几行代码,而不是在充斥着bug散列和神奇常量的土地上走这么长的弯路;正如stackoverflow.com的文章所解释的,它仍然不符合a的所有要求。摆脱这一切,一切从头开始会更快。最终结果应该只有十几行代码:一个
std::map
,一个到达每个单词的循环,小写,递增map计数器。eof后,反转贴图,按最高计数排序。在这一点上,End.h也可以发布site.h。毕竟,我们如何知道您的数组声明有多大?@A.Sky使用
std::map
来实现这一点。SOP使用原始数组并尝试自己进行内存管理,除非您是100%确定的,实现的性能优于C++标准库所提供的。
site::site()
{}
int site::dh(string n, int i) const
{
  return abs(dhash1(n) + i * dhash2(n)) % size;
}
int site::dhash1(string name) const
{
  int site_res = 7;
  for (int i = 0; i < name.length(); i++)
      site_res = (site_res * 31 + name[i]) % 1000000;
  return abs(site_res) % size;
}
int site::getSize()
{
  return size;
}
int site::dhash2(string name) const
{
  int site_res = 7;
  for (int i = 0; i < name.length(); i++)
      site_res = (site_res * 31 + name[i]) % 1000000;
  return 1 + (abs(site_res) % (size - 1));
}
int site::find(string name) const
{
  int i = 0;
  int check_pl = dh(name, i);
  while (site_placement[check_pl].count != 0 || i == size)
  {
      if (site_placement[check_pl].name == name)
          return check_pl;
      i++;
      check_pl = dh(name, i);
  }
  return -1;
}
void site::add(string name)
{
  int first_check = find(name);
  if (first_check == -1)
  {
      int i = 0;
      int place = dh(name, i);
      while (site_placement[place].count != 0)
      {
          i += 1;
          place = dh(name, i);
      }
      site_placement[place].name = name;
      site_placement[place].count = 1;
      check_ten(place);
  }
  else
  {
      site_placement[first_check].count++;
      check_ten(first_check);
  }
}
void site::check_ten(int place)
{
  if (site_placement[place].first_ten_rank)
  {
      sort_rank();
      return;
  }
  else if (rankings[0]->count > site_placement[place].count)
      return;
  rankings[0] = &site_placement[place];
  site_placement[place].first_ten_rank = 1;
  sort_rank();
}
void site::print_ten() const
{
  cout << "RANKINGS" << "- - -" << "SITE" << "- - -" << "HIT" << endl;
  for (int i = 9; i > -1; i--)

      cout << 10 - i << "-)" << "- - -" << rankings[i]->name << "- - -" << rankings[i]->count << "- - -" << endl;
}
void site::sort_rank()
{
  sitecount* temp;
  for (int i = 1; i < 10; i++)
  {
      int j = i;
      while (j > 0 && (rankings[j - 1]->count) > (rankings[j]->count))
      {
          temp = rankings[j];
          rankings[j] = rankings[j - 1];
          rankings[j - 1] = temp;
          j--;
      }
  }
}
site::site(string file_name)
{
  for (int i = 0; i < 10; i++)
      rankings[i] = &site_placement[i];
  ifstream a;
  string s;
  s.clear();
  a.open(file_name.c_str());
  assert(a.is_open() == 1 && "File could not be found");
  string one, two, three, four, five, six, seven, eight, nine, zero;
  one = "1"; two = "2"; three = "3"; four = "4"; five = "5";
  six = "6"; seven ="7"; eight = "8"; nine = "9"; zero = "0";
  while (a>>s ) {
      if(!(s.length()==0)&& s.compare(one)&& s.compare(two)&& s.compare(three)&& s.compare(four)&& s.compare(five)
          && s.compare(six)&& s.compare(seven)&& s.compare(eight)&& s.compare(nine)&&s.compare(zero))
      add(s);
  }
  a.close();
}
#include "site.h"
#include <time.h>
using namespace std;
int main()
{
  const clock_t begin_time = clock();
  site my_site("output.txt");
  my_site.print_ten();
  clock_t end_time2 = clock();
  cout << "It took : " << end_time2 - begin_time << " milliseconds" << endl;
  system("PAUSE");

  return 0;

}
My results
1) love--31
2) kindle2--20
3) latex--10
4) tek--8
5) lt3--5
6) cool--4
7) lot--4
8) blah--3
9) card--3
10)favorite--2

True results

time    48
night   37
good    34
warner  34
love    31
museum  26
nike    26
im  26
gm  22
jquery  21
twitter 20
lebron  20
great   20
google  20
safeway 20
kindle2 20
hate    19
rt  19
today   19
watch   18
api 16
day 15
amp 15
atampt  15
work    14
void site::check_ten(int place)
{
  if (site_placement[place].first_ten_rank)
  {
      sort_rank();
      return;
  }
  else if (rankings[0]->count > site_placement[place].count)
      return;

  //When a site is removed from the top ten list, its first_ten_rank must be set to zero.
  //otherwise, a removed one from topten list will never enter the topten again
  rankings[0]->first_ten_rank = 0;

  rankings[0] = &site_placement[place];
  site_placement[place].first_ten_rank = 1;
  sort_rank();
}