C# Azure Kubernetes.NET核心应用程序到Azure SQL数据库间歇性错误258
我们正在Kubernetes集群中运行.NETCore3.1应用程序。应用程序使用EF Core 3.1.7和Microsoft.Data.SqlClient 1.1.3连接到Azure SQL数据库 在看似随机的时间,我们会收到以下错误C# Azure Kubernetes.NET核心应用程序到Azure SQL数据库间歇性错误258,c#,sql-server,azure,kubernetes,timeout,C#,Sql Server,Azure,Kubernetes,Timeout,我们正在Kubernetes集群中运行.NETCore3.1应用程序。应用程序使用EF Core 3.1.7和Microsoft.Data.SqlClient 1.1.3连接到Azure SQL数据库 在看似随机的时间,我们会收到以下错误 ---> System.Data.SqlClient.SqlException (0x80131904): Timeout expired. The timeout period elapsed prior to completion of the
---> System.Data.SqlClient.SqlException (0x80131904): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at System.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
at System.Data.SqlClient.SqlDataReader.get_MetaData()
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
at System.Data.SqlClient.SqlCommand.ExecuteScalar()
尽管它看起来是随机的,但在较重的负载下,它肯定会更频繁地发生。根据我的研究,似乎这个特定超时与连接超时有关,而不是与命令超时有关。即,客户端根本无法建立连接。这不是一个超时的查询
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
at Microsoft.Data.SqlClient.SqlDataReader.get_MetaData()
at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)
at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method)
at Microsoft.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
我们已经消除了潜在的根本原因:
- Azure SQL Server容量:无论在4或16 vCPU上运行,都会观察到该行为。Azure支持部门还确认日志中没有任何问题。这包括开放连接的数量,仅为50个左右。我们还从其他连接运行了负载测试,服务器运行良好
- Microsoft.Data.SqlClient版本:我们一直在1.1.3版上运行,这种行为一周前才开始(2021-03-16)
- 网络容量:在这个阶段,我们的最大容量约为1-2MB/s,这相当普通
- Kubernetes缩放:事件的发生与我们缩放更多豆荚时没有相关性
- 连接字符串问题:我们的系统过去工作正常,但不管怎样,我们更改了其他文章中提到的一些设置,以查看问题是否无法自行解决。火星是残疾的。我们不能禁用连接池。我们已将
设置为true。以下是当前的连接字符串:TrusServerCertificate
Server=tcp:**.database.windows.net,1433;初始目录=***;持久安全信息=False;用户ID=***;密码=***;MultipleActiveResultSets=False;加密=真;连接超时=60;TrustServerCertificate=True代码>
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
at Microsoft.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
at Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
at Microsoft.Data.SqlClient.SqlConnection.BeginDbTransaction(IsolationLevel isolationLevel)
at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransaction(IsolationLevel isolationLevel)
at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)
使用此命令时,我们的数据库运行状况检查器也收到错误:Microsoft.EntityFrameworkCore.Infrastructure.DatabaseFacade.CanConnect()
与下面的SQL查询超时堆栈跟踪相比,上面的堆栈跟踪是我们试图解决的问题
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
at Microsoft.Data.SqlClient.SqlDataReader.get_MetaData()
at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)
at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method)
at Microsoft.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
根据我的研究,似乎这个特定超时与连接超时有关,而不是与命令超时有关
我不这么认为。调用堆栈通过System.Data.SqlClient.SqlCommand.ExecuteScalar()
完成,因此它在成功连接后运行查询
这是命令超时,由客户端放弃长时间运行的命令引起。默认的CommandTimeout为30秒
要解决该命令花费很长时间的原因,请从和相关的开始
GitHub上有一些关于这个错误的噪音,但我没有看到任何证据表明除了正常的命令超时之外还有其他事情发生。如果你跑的话
using (var con = new SqlConnection(constr))
{
con.Open();
var sql = @"waitfor delay '01:00:00'";
var cmd = con.CreateCommand();
//cmd.CommandTimeout = 0;
cmd.CommandText = sql;
try
{
Console.WriteLine(DateTime.Now);
cmd.ExecuteNonQuery();
}
catch (Exception ex)
{
Console.WriteLine(DateTime.Now);
Console.WriteLine(ex);
}
}
您将获得(使用Microsoft.Data.SqlClient):
或与System.Data.SqlClient(您似乎正在使用的)略有不同:
两者的区别
System.ComponentModel.Win32Exception (258): The wait operation timed out.
及
可能只是Windows与Linux上Win32Exception描述的可用性问题。问题是当时的基础架构问题 在使用dhcp租约的Azure网络中存在已知问题 在某些VM组上发生磁盘连接/分离时丢失。有 目前正在向区域推出的修复方案。我去看看什么时候 将为此发布Azure状态更新 问题消失了,因此看起来修复程序似乎已在全球范围内推出 对于将来遇到此问题的任何其他人,您可以通过建立(而不是pod)来识别它。执行
ls-al/var/log/
并识别所有syslog
文件,并对每个文件运行以下grep
cat /var/log/syslog | grep 'carrier'
如果日志中有任何丢失的承运人
和获得的承运人
消息,则表明存在某种网络问题。在我们的案例中,是DHCP租约
可能是一些暂时性错误。您是否实施了重试策略?我们尝试在EF Sql Server配置中启用
RetryOnFailure()
。没有造成任何不公平的影响。我会就他们的Azure SQL DBs与Azure支持进行更多的争论。SNI是到达SQL登录阶段之前发生的TLS通道握手和建立的服务器名称指示部分。因此,它正在获得到服务器的TCP连接,启动TLS,但没有完成它以进入SQL登录阶段。如果您没有为SQL Azure使用私有端点,那么SQL连接必须通过AKS公共LB。因此,我建议检查AKS公共LB SNAT连接(您可以从Azure门户检查它们),以防万一(在AKS infra资源组中)>kubernetes LB>Metrics)。更多信息此处的错误代码可能相同,但如果查看堆栈跟踪,则故障点不同。在prod中,这通常发生在DbContext创建上,我相信只有选择1
才能查看服务器是否可用。好的,db服务器必须非常繁忙,然后才能命令超时(当前为3m)在该查询上。此外,Azure db支持已检查,服务器有足够的资源。是否有堆栈跟踪显示此情况?并且命令可以在没有足够资源的情况下超时。例如,长时间运行的查询可能是由于阻塞造成的。请参阅更新的帖子,其中我突出显示了不同的堆栈跟踪。
System.ComponentModel.Win32Exception (258): Unknown error 258
cat /var/log/syslog | grep 'carrier'