OperationTimeoutException Cassandra群集AWS/EMR

OperationTimeoutException Cassandra群集AWS/EMR,cassandra,astyanax,Cassandra,Astyanax,我有一个Java应用程序在Amazon上运行,运行在Priam管理的Cassandra集群上 我们使用Amazon的弹性映射/缩减服务,在某个时刻,当我运行EMR并尝试在Cassandra上插入一些数据时,我得到了一个异常:OperationTimeoutException 以下是我在Astyanax上创建Cassandra池时传递的配置参数: `ConnectionPoolConfigurationImpl conPool = new` `ConnectionPoolConfiguration

我有一个Java应用程序在Amazon上运行,运行在Priam管理的Cassandra集群上

我们使用Amazon的弹性映射/缩减服务,在某个时刻,当我运行EMR并尝试在Cassandra上插入一些数据时,我得到了一个异常:OperationTimeoutException

以下是我在Astyanax上创建Cassandra池时传递的配置参数:

`ConnectionPoolConfigurationImpl conPool = new` `ConnectionPoolConfigurationImpl(getConecPoolName())`
    .setMaxConnsPerHost(20)
        .setSeeds("ec2-xx-xxx-xx-xx.compute-1.amazonaws.com")
    .setMaxOperationsPerConnection(100)                       .setMaxPendingConnectionsPerHost(20) 
    .setConnectionLimiterMaxPendingCount(20) 
    .setTimeoutWindow(10000) 
    .setConnectionLimiterWindowSize(1000) 
    .setMaxTimeoutCount(3) 
    .setConnectTimeout(5000) 
    .setMaxFailoverCount(-1) 
    .setLatencyAwareBadnessThreshold(20)
        .setLatencyAwareUpdateInterval(1000)
    .setLatencyAwareResetInterval(10000) 
        .setLatencyAwareWindowSize(100) 
    .setLatencyAwareSentinelCompare(100f) 


AstyanaxContext<Keyspace> context = new AstyanaxContext.Builder()
        .forCluster("clusterName")
        .forKeyspace("keyspaceName")
    .withAstyanaxConfiguration(
           new AstyanaxConfigurationImpl().setDiscoveryType(NodeDiscoveryType.NONE))
    .withConnectionPoolConfiguration(conPool)
    .withConnectionPoolMonitor(new CountingConnectionPoolMonitor())
    .buildKeyspace(ThriftFamilyFactory.getInstance());
所以我不知道解决这个问题的方向,因为问题可能在Astyanax池配置、EC2机器配置(内存增加?)、Priam配置或我代码中AWS上Cassandra或EMR服务所需的其他配置中。。。有什么提示吗


跟踪堆栈跟踪:

ERROR com.s1mbi0se.dg.input.service.InputService (main): EXCEPTION:OperationTimeoutException: [host=ec2-xx-xxx-xx-xx.compute-1.amazonaws.com(10.100.6.242):9160, latency=10004(10004), attempts=1]TimedOutException()

com.netflix.astyanax.connectionpool.exceptions.OperationTimeoutException: OperationTimeoutException: [host=ec2-54-224-65-18.compute-1.amazonaws.com(10.100.6.242):9160, latency=10004(10004), attempts=1]TimedOutException()
    at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:171)
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:61)
    at com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$1$2.execute(ThriftColumnFamilyQueryImpl.java:206)
    at com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$1$2.execute(ThriftColumnFamilyQueryImpl.java:198)
    at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151)
    at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:69)
    at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:253)
    at com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$1.execute(ThriftColumnFamilyQueryImpl.java:196)
    at com.s1mbi0se.dg.input.service.InputService.searchUserByKey(InputService.java:833)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: TimedOutException()
    at org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:7874)
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
    at org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:594)
    at org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:578)
    at com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$1$2.internalExecute(ThriftColumnFamilyQueryImpl.java:211)
    at com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$1$2.internalExecute(ThriftColumnFamilyQueryImpl.java:198)
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:56)
INFO org.apache.hadoop.mapred.TaskLogsTruncater (main): Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
WARN org.apache.hadoop.mapred.Child (main): Error running child
java.lang.RuntimeException: InvalidRequestException(why:Start key's token sorts after end token)
    at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$WideRowIterator.maybeInit(ColumnFamilyRecordReader.java:453)
    at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$WideRowIterator.computeNext(ColumnFamilyRecordReader.java:459)
    at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$WideRowIterator.computeNext(ColumnFamilyRecordReader.java:406)
    at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
    at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
    at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.getProgress(ColumnFamilyRecordReader.java:103)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.getProgress(MapTask.java:522)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:547)
    at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: InvalidRequestException(why:Start key's token sorts after end token)
    at org.apache.cassandra.thrift.Cassandra$get_paged_slice_result.read(Cassandra.java:14168)
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
    at org.apache.cassandra.thrift.Cassandra$Client.recv_get_paged_slice(Cassandra.java:769)
    at org.apache.cassandra.thrift.Cassandra$Client.get_paged_slice(Cassandra.java:753)
    at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$WideRowIterator.maybeInit(ColumnFamilyRecordReader.java:438)
    ... 
INFO org.apache.hadoop.mapred.Task (main): Runnning cleanup for the task

那么,如果将超时设置为-1会发生什么?就个人而言,我会深入研究astyanax代码,并试图找出如何禁用超时。再次运行你的东西,它应该会继续运行,尽管你的集群可能会受到重创,当然,如果你超时了……我想你会同意的

编辑(编辑后):哎呀,我忘了问你用的是哪个版本的卡桑德拉。我正在看这段代码,但第346行是第438行(您使用的是widerow迭代器,它很可能意味着行扫描(即行的片段)

在其中,我们至少可以看到这是一个键范围,但分页为行可能太宽(对于内存不太宽的行,还有另一个迭代器)。我相信您是正确的,您不能使用两个分区器类型。为了获得更多信息,我强烈建议修改ColumnFamilyRecordReader.java以记录ColumnFamilySplit您可以在initialize方法中记录它,也可以记录JobRange(它也有一个toString)

您的版本与此代码有很多相似之处。您使用的是哪个版本

除了拆分,我还会记录密钥片,以防万一,如果我没记错的话,任何一个都可能导致错误。让我知道你的版本,并添加一些日志,以获得有关你情况的更多信息。(他们的东西通常很容易编译,没有问题)


Dean

那么,如果你将超时设置为-1,会发生什么呢?就个人而言,我会深入研究astyanax代码,并尝试找出如何禁用超时。再次运行你的东西,它应该会继续运行,当然如果你超时,你的集群可能会受到打击……我假设你对此没有意见

编辑(后编辑):哎呀,我忘了问你使用的是哪一版本的cassandra。我正在看这段代码,但第346行是你的第438行(你使用的是widerow迭代器,它很可能意味着行扫描(即行的片段)

在其中,我们至少可以看到这是一个键范围,但分页为行可能太宽(对于内存不太宽的行,还有另一个迭代器)。我相信您是正确的,您不能使用两个分区器类型。为了获得更多信息,我强烈建议修改ColumnFamilyRecordReader.java以记录ColumnFamilySplit您可以在initialize方法中记录它,也可以记录JobRange(它也有一个toString)

您的版本与此代码有很多相似之处。您使用的是哪个版本

除了拆分,我还会记录密钥片,以防万一,如果我没记错的话,任何一个都可能导致错误。让我知道你的版本,并添加一些日志,以获得有关你情况的更多信息。(他们的东西通常很容易编译,没有问题)

Dean

我们解决了这个问题(Dean我在Cassandra用户组中回答了这个问题,但我将再次说明我们在这里所做的是为了解决这个问题)

  • 首先,我们更新了Cassandra的1.2.3版本
  • 更新Cassandra后,启动了一个新的异常“No hosts to borrow from”,我们发现命令“ConnectionPoolConfigurationImpl(…).setConnectTimeout(-1)”是导致
  • 我们将.setConnectTimeout(2000)放入
  • 我们增加了Astyanax池中的其他值,我们的应用程序终于工作了
基本上,我认为我们最初的问题是Amazon延迟太高,所以我们改变了池配置,一切正常

谢谢大家的帮助(主要是院长)

下面是我们在Amazon上使用的实际池配置:

new ConnectionPoolConfigurationImpl(getConecPoolName())
.setMaxConnsPerHost(CONNECTION_POOL_SIZE_PER_HOST)
.setSeeds(getIpSeeds())
    .setMaxOperationsPerConnection(10000) 
    .setMaxPendingConnectionsPerHost(20) 
    .setConnectionLimiterMaxPendingCount(20)    
        .setTimeoutWindow(10000) 
    .setConnectionLimiterWindowSize(2000)
    .setMaxTimeoutCount(3) 
    .setConnectTimeout(100) 
    .setConnectTimeout(2000)
    .setMaxFailoverCount(-1) 
    .setLatencyAwareBadnessThreshold(20)
    .setLatencyAwareUpdateInterval(1000) // 10000
    .setLatencyAwareResetInterval(10000) // 60000
    .setLatencyAwareWindowSize(100) // 100
    .setLatencyAwareSentinelCompare(100f)                      .setSocketTimeout(30000)
    .setMaxTimeoutWhenExhausted(10000)
    .setInitConnsPerHost(10)
        ;

AstyanaxContext<Keyspace> context = new AstyanaxContext.Builder().forCluster(clusterName).forKeyspace(keyspaceName)
                .withAstyanaxConfiguration(new AstyanaxConfigurationImpl().setDiscoveryType(NodeDiscoveryType.NONE).setConnectionPoolType(ConnectionPoolType.ROUND_ROBIN).setDiscoveryDelayInSeconds(10000)
        .setDiscoveryDelayInSeconds(10000))
        .withConnectionPoolConfiguration(conPool)
            .withConnectionPoolMonitor(new CountingConnectionPoolMonitor())
        .buildKeyspace(ThriftFamilyFactory.getInstance());
new ConnectionPoolConfigurationMPL(getConecPoolName())
.setMaxConnsPerHost(每个主机的连接池大小)
.setSeeds(getIpSeeds())
.setMaxOperationsPerConnection(10000)
.setMaxPendingConnectionsPerHost(20)
.setConnectionLimiterMaxPendingCount(20)
.setTimeoutWindow(10000)
.SetConnectionLimiterWindowsSize(2000)
.setMaxTimeoutCount(3)
.setConnectTimeout(100)
.setConnectTimeout(2000)
.setMaxFailoverCount(-1)
.setLatencyAwareBadnessThreshold(20)
.setLatencyAwareUpdateInterval(1000)//10000
.setLatencyAwareResetInterval(10000)//60000
.setLatencyAwareWindowSize(100)//100
.SETLATENCYAWARE持续比较(100f)。setSocketTimeout(30000)
.SetMaxTimeOuthen(10000)
.setInitConnsPerHost(10)
;
AstyanaxContext context=new AstyanaxContext.Builder().forCluster(clusterName).forKeyspace(keyspaceName)
.withAstyanaxConfiguration(新的AstyanaxConfiguration().setDiscoveryType(NodeDiscoveryType.NONE)。setConnectionPoolType(ConnectionPoolType.ROUND_ROBIN)。SetDiscoveryLayinSonds(10000)
.SETDISCOVERYDELAYINSEONDS(10000))
.withConnectionPoolConfiguration(conPool)
.withConnectionPoolMonitor(新计数ConnectionPoolMonitor())
.buildKeyspace(ThriftFamilyFactory.getInstance());
我们解决了这个问题(Dean我回答了这个问题
new ConnectionPoolConfigurationImpl(getConecPoolName())
.setMaxConnsPerHost(CONNECTION_POOL_SIZE_PER_HOST)
.setSeeds(getIpSeeds())
    .setMaxOperationsPerConnection(10000) 
    .setMaxPendingConnectionsPerHost(20) 
    .setConnectionLimiterMaxPendingCount(20)    
        .setTimeoutWindow(10000) 
    .setConnectionLimiterWindowSize(2000)
    .setMaxTimeoutCount(3) 
    .setConnectTimeout(100) 
    .setConnectTimeout(2000)
    .setMaxFailoverCount(-1) 
    .setLatencyAwareBadnessThreshold(20)
    .setLatencyAwareUpdateInterval(1000) // 10000
    .setLatencyAwareResetInterval(10000) // 60000
    .setLatencyAwareWindowSize(100) // 100
    .setLatencyAwareSentinelCompare(100f)                      .setSocketTimeout(30000)
    .setMaxTimeoutWhenExhausted(10000)
    .setInitConnsPerHost(10)
        ;

AstyanaxContext<Keyspace> context = new AstyanaxContext.Builder().forCluster(clusterName).forKeyspace(keyspaceName)
                .withAstyanaxConfiguration(new AstyanaxConfigurationImpl().setDiscoveryType(NodeDiscoveryType.NONE).setConnectionPoolType(ConnectionPoolType.ROUND_ROBIN).setDiscoveryDelayInSeconds(10000)
        .setDiscoveryDelayInSeconds(10000))
        .withConnectionPoolConfiguration(conPool)
            .withConnectionPoolMonitor(new CountingConnectionPoolMonitor())
        .buildKeyspace(ThriftFamilyFactory.getInstance());