Clustering

Generar Cluster Log de los ultimos 60 minutos de actividad.

Get-ClusterLog -Node <SQLServerNodeName> -TimeSpan 60 -UseLocalTime

El archivo cluster.log se genera en C:\WINDOWS\CLUSTER\REPORT

Para un failover automatico exitoso todas las Bases de daoto del Availability Group deben estar sincronizadas

SELECT database_name, is_failover_ready FROM sys.dm_hadr_database_replica_cluster_states WHERE replica_id IN (SELECT replica_id FROM sys.dm_hadr_availability_replica_states)

Un valor igual a 0 de la columna is_failover_ready para alguna de las bases de datos del Availabilty Group indicala que no están sincronizadas y evita un failover automático.

Cambiar el valor de la propiedad SESSION_TIMEOUT.

ALTER AVAILABILITY GROUP AccountsAG MODIFY REPLICA ON 'INSTANCE09' WITH (SESSION_TIMEOUT = 15);

Database-level health detection – SQL Server Always On | Microsoft Learn

The HealthCheckTimeout setting is used to specify the length of time, in milliseconds, that the SQL Server resource DLL should wait for information returned by the sp_server_diagnostics stored procedure before reporting the Always On Failover Cluster Instance (FCI) as unresponsive. Changes that are made to the timeout settings are effective immediately and do not require a restart of the SQL Server resource.

Health-Check Timeout Threshold

WSFC resource DLL of the availability group performs a health check of the primary replica by calling the sp_server_diagnostics stored procedure on the instance of SQL Server that hosts the primary replica. sp_server_diagnostics returns results at an interval that equals 1/3 of the health-check timeout threshold for the availability group. The default health-check timeout threshold is 30 seconds, which causes sp_server_diagnostics to return at a 10-second interval. If sp_server_diagnostics is slow or is not returning information, the resource DLL will wait for the full interval of the health-check timeout threshold before determining that the primary replica is unresponsive. If the primary replica is unresponsive, an automatic failover is initiated, if currently supported.

SameSubnetDelay

Controls the delay, in milliseconds, between netft heartbeats. This property applies to routes on the same network subnet.

SameSubnetThreshold

Controls how many heartbeats can be missed on the same subnet before the route is declared as unreachable. This property applies to routes on the same network subnet.

Get-Cluster | Format-List -Property *
Get-Cluster | Format-List -Property Quarantine*

Session Timeout

 The session-timeout period is a replica property that controls how many seconds (in seconds) that an availability replica waits for a ping response from a connected replica before considering the connection to have failed. By default, a replica waits 10 seconds for a ping response. This replica property applies only the connection between a given secondary replica and the primary replica of the availability group
The session-timeout period is an availability-replica property that determines how long connection with another availability replica can remain inactive before the connection is closed. The primary and secondary replicas ping each other to signal that they’re still active. Receiving a ping from the other replica during the timeout period indicates that the connection is still open, and that the server instances are communicating. On receiving a ping, an availability replica resets its session-timeout counter on that connection.

The session-timeout period prevents either replica from waiting indefinitely to receive a ping from the other replica. If no ping is received from the other replica within the session-timeout period, the replica times out. Its connection is closed, and the timed-out replica enters the DISCONNECTED state. Even if a disconnected replica is configured for synchronous-commit mode, transactions don’t wait for that replica to reconnect and resynchronize.

The default session-timeout period for each availability replica is 10 seconds. This value is user-configurable, with a minimum of 5 seconds. Generally, we recommend that you keep the time-out period at 10 seconds or greater. Setting the value to less than 10 seconds creates the possibility of a heavily loaded system declaring a false failure.

What do the connection time-out errors indicate?

The default value is 10 seconds for the availability group replica setting, SESSION_TIMEOUT. This setting is configured for each replica. It determines how long the replica waits to receive a response from its partner replica before it reports a connection time-out. If a replica gets no response from the partner replica, it reports a connection time-out in the Microsoft SQL Server error log and the Windows Application log. The replica that reports the time-out immediately tries to reconnect, and will continue to try every five seconds.

If you want to view the witness quorum vote, you can use Get-Cluster cmdlet to check the WitnessDynamicWeight. In a cluster with 4-nodes, the dynamic witness weight will be 1 so there is an odd number of quorum votes for the cluster.

Get-ClusterNode | ft ID, NodeName, NodeWeight, DynamicWeight, State -AutoSize
Get-Cluster | ft Name, WitnessDynamicWeight -AutoSize

Recopilación avanzada de datos: recopilación de seguimiento de red durante el tiempo de espera de conexión

Si el diagnóstico anterior de la aplicación de SQL Server no produjo una causa principal, debe comprobar la red. El análisis correcto de la red requiere que recopile un seguimiento de red que abarque el tiempo de espera de la conexión.

El siguiente procedimiento inicia un seguimiento de red de Windows netsh en las réplicas en las que se notifican los tiempos de espera de conexión en los registros de errores de SQL Server. Se desencadena una tarea de eventos programados de Windows cuando se registra uno de los errores de conexión de SQL Server en el registro de aplicaciones. La tarea programada ejecuta un comando para detener el netsh seguimiento de red para que los datos de seguimiento de red clave no se sobrescriban. Estos pasos también asumen una ruta de acceso de *F:* para los registros de seguimiento y lotes. Ajuste esta ruta de acceso al entorno.

Inicie un seguimiento de red, como se muestra en el siguiente fragmento de código, en las dos réplicas en las que se producen los tiempos de espera de conexión:

netsh trace start capture=yes persistent=yes overwrite=yes maxsize=500 tracefile=f:\trace.etl

Cree tareas programadas de Windows que detengan el netsh seguimiento en los eventos 35206 o 35267. Puede crear estas tareas en una línea de comandos administrativa:

schtasks /Create /tn Event35206Task /tr F:\stoptrace.bat /SC ONEVENT /EC Application /MO *[System/EventID=35206] /f /RL HIGHEST 
schtasks /Create /tn Event35267Task /tr F:\stoptrace.bat /SC ONEVENT /EC Application /MO *[System/EventID=35267] /f /RL HIGHEST

Una vez que se produzca el evento y se detengan y capturen los seguimientos de red, puede eliminar las ONEVENT tareas:

PS C:\Users\sqladmin> Schtasks /Delete /tn Event35206Task /F 
PS C:\Users\sqladmin> Schtasks /Delete /tn Event35267Task /F

El análisis del seguimiento de red está fuera del ámbito de este solucionador de problemas. Si no puede interpretar el seguimiento de red, póngase en contacto con el equipo de soporte técnico de Microsoft SQL Server y proporcione el seguimiento junto con otros archivos de registro solicitados para el análisis de la causa principal.

Start-ClusterNode –ClearQuarantine

HADR configuration best practices – SQL Server on Azure VMs | Microsoft Learn