Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/url/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hadoop 如何指示配置单元在查询Cassandra时使用分区/主键_Hadoop_Cassandra_Hive_Apache Pig_Datastax Enterprise - Fatal编程技术网

Hadoop 如何指示配置单元在查询Cassandra时使用分区/主键

Hadoop 如何指示配置单元在查询Cassandra时使用分区/主键,hadoop,cassandra,hive,apache-pig,datastax-enterprise,Hadoop,Cassandra,Hive,Apache Pig,Datastax Enterprise,我们正在运行Datastax Enterprise 4.0.1,并尝试在Cassandra中针对CF运行不同的M/R作业。我们已经设置了列族: CREATE TABLE pageviews ( website text, date text, created timestamp, browser_id text, ip text, referer text, user_agent text, PRIMARY KEY ((website, date), create

我们正在运行Datastax Enterprise 4.0.1,并尝试在Cassandra中针对CF运行不同的M/R作业。我们已经设置了列族:

CREATE TABLE pageviews (
  website text,
  date text,
  created timestamp,
  browser_id text,
  ip text,
  referer text,
  user_agent text,
  PRIMARY KEY ((website, date), created, browser_id)
) WITH bloom_filter_fp_chance=0.001000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=1.000000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='NONE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
  compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};
Hive的好处是它处理CQL3“扁平化”,以抽象Cassandra的底层列/行存储机制。的缺点是它不使用Cassandra的分区键或主键来执行快速查找,例如

SELECT COUNT(1) WHERE website = "blah" AND date = "blah";
运行该MR作业似乎会执行完整的表扫描,而不是预先缩小它必须解析的键的范围。如果基于分区/主键的筛选有明显的好处,是否可以告诉配置单元不要执行完整表扫描


旁注:当使用Pig时,它似乎可以并且确实使用Cassandra的分区/主键执行快速查找。Pig的缺点是我们必须自己进行所有的过滤和整平,这大大妨碍了创造就业机会的时间。

最好的办法是使用Pig,并使用cql://和CqlStorage(),这为您完成了整平Cassandra数据的繁重任务,例如

grunt> pageviews = LOAD 'cql://ks/pageviews' USING CqlStorage();
grunt> describe pageviews;
grunt> pageviews: {website: chararray,date: chararray,created: long,browser_id: chararray,ip: chararray,referer: chararray,user_agent: chararray}