Back (Current repo: scraps)

random scraps and notes that are useful to me
To clone this repository:
git clone https://git.viktor1993.net/scraps.git
Log | Download | Files | Refs

commit 8b5f26362930b13d9332335e2410d6e3426678d8
parent de7d1f8a6238e92da165625ec74725ec0028f278
Author: root <root>
Date:   Thu, 29 May 2025 16:42:08 +0200

add some new notes and snippets

Diffstat:
Anotes/sql/binary_and_relay_logs.txt | 19+++++++++++++++++++
Anotes/sql/clustered_vs_nonclustered_index.txt | 35+++++++++++++++++++++++++++++++++++
Anotes/sql/cte_vs_subquery.txt | 5+++++
Anotes/sql/replica_lag_troubleshoot_checklist.txt | 28++++++++++++++++++++++++++++
Anotes/sql/seconds_behind_master.txt | 11+++++++++++
Anotes/sql/virtual_columns.txt | 5+++++
Asql/snippets/read_binlog.sh | 1+
Asql/snippets/update_ranks.sql | 11+++++++++++
8 files changed, 115 insertions(+), 0 deletions(-)

diff --git a/notes/sql/binary_and_relay_logs.txt b/notes/sql/binary_and_relay_logs.txt @@ -0,0 +1,19 @@ +What are the key differences between MariaDB's binlog and relay log? + +The master writes all transactions into its binary log. This is usually a sequential and very lightweight activity. + +The replica server reads the transactions from masters binary log and writes them to its relay log. +Only after that the replica executes the statements from its relay log. + +From the POV of Master: + +Whenever a replica server connects to a master, the master creates a new thread for the connection (similar to any other server client) and then it does whatever the replica user asks. Most of that is going to be a request a feed of events from the master's binary log and to get notified about newly written events to the master's binary log. + +Replica servers that are up to date will mostly be reading events that are still cached in OS cache on the master server, so usually, there shouldn't be too much physical disk reads on the master server in order to feed binary log events to the replica.. The most common exception is when reconnecting a replica server that may have been down for a couple of hours for one reason or another as it will start with reading binary logs that were written hours ago, which the master may no longer have cached. + +From the POV of the Replica: + +The I/O thread process connects to the designated master, reads its binary log events as they come in and just copies them over to a local log file called relay log. This leads us to seeing a second process, the The second process, the SQL thread, which reads events from the relay log stored locally on the replication slave (the file that was written by I/O thread) and then applies them as fast as possible. + +Basically, the relay log is just a local file buffering the master's binlog contents. Separately, the SQL thread reads the relay log and executes the events to replay the master's changes. This two-step model (I/O thread + SQL thread) decouples network transfer from SQL execution, helping replicas survive short disconnects from the master server. + diff --git a/notes/sql/clustered_vs_nonclustered_index.txt b/notes/sql/clustered_vs_nonclustered_index.txt @@ -0,0 +1,35 @@ +What is the difference between clustered and non-clustered index? + + +# Clustered Index + +Definition: A clustered index determines the physical order of data in a table. The rows of the table are stored in the same order as the clustered index. + +Properties: + +-> Only one per table. Each table can only support one clustered index, since the rows of the table can only be stored in one order. + +-> Often it is the PK. When a Primary Key is defined, it's created as a clustered index. + +-> The data and the index are stored together. + +-> Help speed up queries that inspect a range of values e.g. BETWEEN, ORDER BY. + +-> Slows down INSERT/DELETE/UPDATE queries, since these operations can cause a re-ordering of rows. + + +# Non-Clustered Index + +Definition: A non-clustered index creates a structure that is separate from the table data. This structure contains pointers (row locators) to the data's physical location. + +Properties: + +-> You can have as many as you want. Since this is a separate structure, you can attach as many as you want. + +-> Does not affect the physical order of rows in the table. + +-> Less helpful for range queries or sorting than a clustered index. Still very helpful for speeding up WHERE/JOINs in queries. + +-> Takes up additional space to store the structure of the index. More non-clustered indexes = more space used. + +-> Lower impact on INSERT/DELETE/UPDATE queries. diff --git a/notes/sql/cte_vs_subquery.txt b/notes/sql/cte_vs_subquery.txt @@ -0,0 +1,5 @@ +# Difference between CTE and sub-query + +-> A CTE can be named, a sub-query can only be aliased. This means that a CTE can be referenced many times within a complex SQL query, while a sub-query would have to be re-executed each time. So a CTE can allow for better code reuse. + +-> Code readability. As CTE's go to the front of the Query, it can be a lot less mentally taxing to understand compared to trying to read sub-queries that have 4-5 layers of nestedness. diff --git a/notes/sql/replica_lag_troubleshoot_checklist.txt b/notes/sql/replica_lag_troubleshoot_checklist.txt @@ -0,0 +1,28 @@ +How would you approach troubleshooting replication lag in a MariaDB setup? + +Detection: + +-> use pt-heartbeat to be able to measure replica lag, and set up automated alerts (e.g. email) when lag exceeds thresholds, so I can c +orrelate lag with application (or user) behaviors in real time + +-> set up periodic checks and e-mail alerts for any locking queries, locking issues on master often cascade down to replicas + +Diagnosis: + +-> make sure different applications/scripts have dedicated users on the DB, reduce the number of "power users" to help more easily iden +tify where replica issues originate from + +-> analyze audit logs around time of peak replica spikes, look for big transactions that could be chunked, look for queries holding loc +ks on tables, check that no large DDL happens during peak times + +Mitigation: + +-> Help developers by writing Stored procs that chunk queries (e.g. deletes), or offer them help via scripts if the inputs are data fil +es (chunk into LOAD DATA statements) + +-> Check with developers/business whether certain tables can be ignored from replication, e.g. some tables might be used to help aggreg +ate data, but their contents are effectively useless + +-> Check if certain processes could be moved to different times, to avoid death by a 1000 cuts scenarios + +-> If resources allow, move dwh, analytical and ad-hoc user read queries to dedicated replica server diff --git a/notes/sql/seconds_behind_master.txt b/notes/sql/seconds_behind_master.txt @@ -0,0 +1,11 @@ +In a heavily-loaded MariaDB replication setup, you notice that Seconds_Behind_Master is fluctuating wildly even though your pt-heartbeat shows consistent lag. What could cause this discrepancy, and which tool is more trustworthy? + +Seconds_Behind_Master -> shows how "old" the event currently being executed by the SQL thread is, compared to the master's event timestamp. + + long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp) - mi->clock_diff_with_master); + + +If the SQL thread stalls or catches up in bursts (due to I/O lag, locking, or massive transactions), Seconds_Behind_Master can jump around dramatically. + +pt-heartbeat can give a more accurate measurement, because it inserts current timestamps into a small table, therefore it is measuring "how far back in real time the replica is", and not "how old the binlog event is." + diff --git a/notes/sql/virtual_columns.txt b/notes/sql/virtual_columns.txt @@ -0,0 +1,5 @@ +Can you explain the concept of virtual columns in MariaDB and provide a use case? + +Virtual columns in MariaDB are columns that are not physically stored in the table (unless explicitly set to STORED), but are instead computed on the fly based on an expression. + +A use case of a virtual column could be to allow indexing to happen in situations where the WHERE clause needs to do typechecking. For example, a query's WHERE clause might do LENGTH(column_value) on column1, in which case an index on column1 would not help. However, if we generate the column with "GENERATED ALWAYS AS LENGTH(column_value), the virtual column can be indexed to boost the speeds of slow selects. diff --git a/sql/snippets/read_binlog.sh b/sql/snippets/read_binlog.sh @@ -0,0 +1 @@ +mysqlbinlog --base64-output=decode-rows -v bin-mariadb.000001 diff --git a/sql/snippets/update_ranks.sql b/sql/snippets/update_ranks.sql @@ -0,0 +1,11 @@ +SET @ROW := 1; + +UPDATE some_table t +JOIN ( + SELECT id + FROM some_other_table + WHERE some_value IN (111,123) -- Optional filtering + ORDER BY id -- Important: define row order + LIMIT 10000 +) ordered ON t.id = ordered.id +SET t.rank = (@ROW := @ROW + 1);