replica_lag_troubleshoot_checklist.txt (1389B)
How would you approach troubleshooting replication lag in a MariaDB setup? Detection: -> use pt-heartbeat to be able to measure replica lag, and set up automated alerts (e.g. email) when lag exceeds thresholds, so I can c orrelate lag with application (or user) behaviors in real time -> set up periodic checks and e-mail alerts for any locking queries, locking issues on master often cascade down to replicas Diagnosis: -> make sure different applications/scripts have dedicated users on the DB, reduce the number of "power users" to help more easily iden tify where replica issues originate from -> analyze audit logs around time of peak replica spikes, look for big transactions that could be chunked, look for queries holding loc ks on tables, check that no large DDL happens during peak times Mitigation: -> Help developers by writing Stored procs that chunk queries (e.g. deletes), or offer them help via scripts if the inputs are data fil es (chunk into LOAD DATA statements) -> Check with developers/business whether certain tables can be ignored from replication, e.g. some tables might be used to help aggreg ate data, but their contents are effectively useless -> Check if certain processes could be moved to different times, to avoid death by a 1000 cuts scenarios -> If resources allow, move dwh, analytical and ad-hoc user read queries to dedicated replica server