《MySQL Technical Insider InnoDB Storage Engine (Second Edition)》 reading notes.
Buffer Pool#
For databases, the efficiency of querying and reading is crucial, and generally, databases store data on disk. However, the read and write speeds of disks are relatively slow and cannot meet the high-speed read and write requirements of databases. Therefore, storage engines typically use buffer pool technology to read a portion of the data from the disk into memory for caching, referred to as the buffer pool.
- For read requests, first check if the corresponding data is in the buffer pool. If it is, read it directly from memory and return it. If not, read it from the disk, place it in memory, and then return it.
- For write requests, first check if the corresponding data is in the buffer pool. If it is, modify the memory data and return. If not, read the corresponding data from the disk into memory, modify the memory data, and return. The storage engine will later flush the dirty data back to the disk at certain points in time.
The most important parameter of the buffer pool is its size, which directly affects the performance of the database. Imagine if the buffer pool is not large enough, the cache hit rate is very low, and frequent disk reads are required, which would be very inefficient. InnoDB also uses buffer pool technology, and its size is configured in innodb_buffer_pool_size
. The InnoDB buffer pool caches not only data but also indexes, undo buffers, change buffers, hash indexes, etc. These memory data are organized in memory in the form of pages, with the page size configured in innodb_page_size
, which defaults to 16K.
InnoDB supports multiple buffer pool instances, with each page being allocated to different instances based on a hash. This is configured in innodb_buffer_pool_instances
, which defaults to 1. Multiple buffer pools can reduce internal resource contention, such as locks, and improve concurrency performance.
LRU List#
The task of the buffer pool is to read the required data pages from memory. If it can be read from memory, it is called a hit, and the response speed will be very fast. If it cannot be read, it is a miss, and it needs to read from the disk, which is much slower. The most important parameter to observe the performance of a buffer pool is the buffer hit rate.
To improve the hit rate of the buffer, the LRU (Least Recently Used)
algorithm is generally used. This algorithm maintains a buffer list, placing the most frequently accessed pages at the front of the list and the least used at the back. If the cache does not hit, a new page is read from the disk, the last page in the list is discarded, and the newly read page is added to the list. The specific position of the new page in the list depends on the implementation.
InnoDB adopts this algorithm but has made some optimizations. In InnoDB's LRU list, newly read pages are placed at the midpoint
of the list. The midpoint
is a percentage, and the list after this percentage is called the old list, while the list before it is called the new list. The InnoDB LRU is actually composed of these two separate lists, with the modpoint
value configured in innodb_old_blocks_pct
, which defaults to 37, meaning that newly read pages will be placed 37% from the end of the list.
Why not place it at the front? Imagine if it were placed at the front, it would be treated as the most frequently used page. However, for certain SQL operations, such as indexing or data scanning, a large amount of data may be accessed, but this data may only be used once during this scan. If such data is placed at the front every time, it may push out the truly hot data from the LRU list, causing a decrease in the hit rate. However, newly read data pages may indeed be hot data; how can this be judged?
First, it is necessary to know that pages that hit in the new list will be promoted to the front of the new list. Pages that hit in the old list will also be promoted to the front of the old list.
Now, suppose the newly read data page is hot data. For a period of time, this page will definitely be accessed frequently, and each access will promote it to the head of the old list, meaning this page will not be flushed out. However, if it is not hot data but temporary data, it will soon be pushed down and may even be flushed out of the old list. Thus, by allowing data pages to ensure they can survive long enough, it can be proven that this is a hot data page. Therefore, InnoDB also adds another parameter configured in innodb_old_blocks_time
, which indicates how long a newly read page will remain in the old list before being moved to the new list. Moving a page from old to new is called "page made young," while being flushed out of the old list without surviving long enough is called "page not made young."
Unzip LRU#
In the buffer pool, some data pages may not be 16KB in size. If a page occupies LRU's 16KB, it may be somewhat wasteful. Therefore, InnoDB has added an unzip_LRU list to manage these non-16KB compressed pages.
Now suppose we need a 2KB page; unzip_LRU will first look for a 2KB page:
- If not found, continue to look for a 4KB page. If found, split it into two 2KB pages to complete the allocation.
- If not found, continue to look for an 8KB page. If found, split it into one 4KB and two 2KB pages to complete the allocation.
- If not found, continue to look for a 16KB page, splitting it into one 8KB, one 4KB, and two 2KB pages to complete the allocation.
Free List#
The discussion of the LRU list above assumes that the LRU is already full. However, in reality, when the database starts, the LRU is empty. At this time, all pages in the buffer pool are placed in the Free list. Each time a new page is read from the disk, it first checks the Free list for any free pages. If there are free pages, it takes one out and places it in the LRU list. If not, it follows the above LRU discussion to discard a page.
Flush List#
The above two lists discuss read situations, but normal database requests also involve many write operations. For write operations, it still follows the LRU rules to read the data page from memory, then modify this data page. This modified data page is called a dirty page, but dirty pages are not removed from the LRU; instead, they are also added to another Flush list. This way, the data modification is completed in memory, and subsequent reads of this dirty page are still handled by the LRU, ensuring that the latest data can be read. Meanwhile, the storage engine will flush all the data in the Flush list back to the disk through a certain mechanism. This flushing is not immediate; otherwise, a large amount of disk writing would significantly affect the database's performance. Of course, if the database crashes, it may lead to dirty pages not being written back in time. In this case, the durability of disk data is guaranteed by the implementation of transactions.
Checkpoint Mechanism#
Before discussing the Checkpoint technology, let's first discuss the issue of dirty pages in the Flush list that have not been flushed. In InnoDB's transaction implementation, before each transaction is committed, a redo log
is written, also known as a redo log, which is first written to the memory's redo log buffer
and then flushed to the disk at a certain frequency, recording the complete modification of this transaction to the data. After that, the memory page is modified, so when the database crashes and restarts, it can recover data through the redo log. In this way, ideally, the database does not even need to flush dirty pages back to the disk; it just needs to load the redo log upon restart. Of course, this is based on the premise that memory and disk are large enough, which is unrealistic. Therefore, dirty pages still need to be flushed back to the disk. When the database crashes and restarts, it only needs to recover the redo logs that have been flushed back to the disk.
In addition to data modifications, if a page discarded from the LRU old list is a dirty page, it also needs to be flushed back to the disk; otherwise, the next time data is read from the disk, it will be old data. The size of the redo log buffer in memory is configured in innodb_log_buffer_size
. If it exceeds this limit, it cannot guarantee transactions, so dirty pages are also flushed to the disk in real-time to ensure there is free space in the redo log buffer.
InnoDB uses the Checkpoint
mechanism to flush dirty pages back to the disk. Checkpoints are internally divided into two types:
One is Sharp Checkpoint
, which flushes all dirty pages back to the disk when the database is normally shut down. Here, normal shutdown means that the configuration innodb_fast_shutdown
is set to the default value of 1, which flushes the dirty pages in the buffer pool back, without caring about the rest. If this value is 0, the database will perform a complete page recovery and change buffer merging before shutting down, which is called slow shutdown. If configured to 2, it will only write the log files to the disk log files and stop. The integrity of table data needs to wait for recovery during the next startup. The specific behavior of this parameter can be found in the official documentation, and there is also a related configuration innodb_force_recovery
that controls InnoDB's recovery.
The other is Fuzzy Checkpoint
, which continuously flushes a portion of dirty pages to the disk under certain conditions. These conditions include:
- The main thread asynchronously flushes dirty pages from the buffer pool to the disk at regular intervals.
- If a data page is evicted from the LRU list and it is a dirty page, it needs to be flushed back to the disk.
- When the available space in the redo log is low, if it reaches 75% of the maximum capacity of the redo log file, it needs to execute asynchronous flushing of dirty pages to the disk; otherwise, insufficient redo log space will affect transaction commits. If it exceeds 90%, it needs to be flushed to the disk synchronously.
- If the number of dirty pages is too high, it also needs to be flushed back to the disk. The configuration
innodb_max_dirty_pages_pct_lwm
indicates a lower percentage of dirty pages relative to the buffer pool size; once this percentage is reached, pre-flushing begins (TODO: What is pre-flushing?). When the percentage reaches the configuredinnodb_max_dirty_pages_pct
, normal flushing will occur.
Viewing Buffer Pool Status#
By using the command to view the status of the InnoDB storage engine, you can check the running status of InnoDB. Here are several key parameters related to the buffer pool and memory, which can generally be inferred from their names. The sum of Free and LRU does not equal the total buffer pool size because pages in the buffer pool are also allocated for other purposes, such as hash or change buffer, etc.
> show engine innodb status;
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137363456 // Total memory allocated to InnoDB
Buffer pool size 8192 // Number of pages in the memory pool, size must be multiplied by page_size
Free buffers 7146 // Number of pages in the Flush List
Database pages 1042 // Number of pages in the LRU List
Old database pages 402 // Number of pages in the old list of the LRU
Buffer pool hit rate 1000 / 1000 // Cache hit rate
InnoDB Engine Features#
Introduction to MySQL and InnoDB#
MySQL is a commonly used relational database, and one of its distinctions from other databases is that it provides a pluggable storage engine. According to the official statement, this pluggable feature allows you to write your own storage engine and add it to a running MySQL instance without needing to recompile or restart.
InnoDB is the default storage engine since MySQL 5.5.8. Its design goal is primarily for OLTP
, characterized by row-level locking, support for foreign keys, and default read operations that do not produce locks. It can store each table's data in a separate .idb
file. InnoDB achieves high concurrency through MVCC (Multi-Version Concurrency Control)
. It implements four isolation levels, with the default being REPEATABLE
.
InnoDB stores table data in the order of the primary key. If a primary key is not explicitly specified during definition, InnoDB automatically generates a six-byte ROWID
as the primary key. If the data volume exceeds the range of this ROWID
(281 trillion rows), the earliest data will be overwritten.
Change Buffer#
Typically, each table in a database will have a clustered index (which we commonly refer to as the primary key) and may also have multiple auxiliary indexes. Suppose we need to insert a new record into the table; at this point, the index tree needs to be updated. For the case of having only one clustered index, it is straightforward because clustered indexes are generally sequentially increasing. When adding a new row, it only needs to be written sequentially, adding the new index to the end of the index tree. However, if there are auxiliary indexes, it is necessary to find the position where the new index should be placed in the auxiliary index tree. If this index page is not in the buffer pool, random disk read and write operations will be required. For a large number of write requests, such multiple random disk I/O operations significantly affect performance.
Therefore, InnoDB introduced a change buffer
scheme. When performing insert, update, or delete operations on auxiliary indexes, if the index page is in the buffer pool, it is modified directly in the buffer pool. If it is not in the buffer pool, it is first placed in the change buffer, waiting for other operations, such as a read request for the current index page, which will then require reading it from the disk into the buffer pool. The change buffer will then merge its modifications, thus avoiding the need for random disk I/O for every insert/update/delete.
Additionally, if an index page has not been read into the buffer pool for a period but has undergone multiple accesses or modifications, these multiple operations will also be merged on the corresponding index page in the change buffer, requiring only one write back to the disk later.
Moreover, the change buffer has certain limitations. First, the index must be an auxiliary index, and it cannot be unique. If the index needs to be unique, then during insert/update, it will inevitably require scanning all index pages to confirm whether there are duplicate indexes, making the change buffer meaningless.
The change buffer is divided into three types:
- Insert buffer indicates a buffer for an insert operation.
- Delete buffer indicates marking a record for deletion.
- Purge buffer indicates the actual deletion of a record.
In InnoDB, updates are completed through a combination of delete and purge operations. The configuration innodb_change_buffering
can specify the types of change buffer to be buffered, with the default being all, meaning all types are buffered.
Internally, the structure of the change buffer in InnoDB is a B+ tree. This tree also has size limits, configured in innodb_change_buffer_max_size
, which indicates the maximum percentage of the change buffer relative to the total buffer pool size, with the default being 25. If the amount of change buffer reaches a certain size, it will also force reading the index page for merging to avoid insufficient change buffer space.
Double Write#
During the operation of the database, data in the buffer pool is continuously synchronized to the disk to ensure data durability. At this time, some issues may arise. For example, if a 16KB data page is being written to the disk and only 4KB has been written when the database crashes, even if there is a redo log, the page on the disk has already been partially written and is dirty. The physical log stored in the redo log cannot restore the entire page (for detailed reasons, see the redo log section below). Therefore, before using the redo log, it is necessary to back up this page. If the write fails, the disk page can be restored from the backup page, and then the redo log can be used to attempt recovery.
InnoDB uses doublewrite technology to solve this problem. There is a region in memory called the doublewrite buffer, which is 2MB in size, and there is also a 2MB space on the disk's shared tablespace. During the flushing of dirty pages from the buffer pool, the dirty page is first written to the shared tablespace on the disk and executed fsync
, and then written to the various table data files. This way, even if an error occurs during the writing of data files, a copy of the dirty page can be retrieved from the shared tablespace. (TODO: Why is it written to the shared tablespace in two parts, each 1MB?)
The doublewrite feature can be disabled by configuring innodb_doublewrite
. This feature is not needed on certain file systems that provide write failure protection.
Asynchronous I/O#
InnoDB uses asynchronous I/O to flush data to the disk, which prevents thread blocking, and AIO can also perform I/O merging to reduce the number of disk I/Os. For more information about AIO under Linux, you can refer to the man documentation.
You can enable or disable AIO by configuring innodb_use_native_aio
.
Flushing Neighboring Pages#
When InnoDB flushes a dirty page back to the disk, it can check all pages in the area where this page is located. If there are other dirty pages, they are also flushed to the disk, which can utilize I/O merging to reduce the number of I/Os. This can be toggled using the innodb_flush_neighbors
configuration.
InnoDB Worker Threads#
Having looked at some of the runtime states of InnoDB, let's now introduce how InnoDB works.
Master Thread#
The master thread is the main thread of InnoDB's operation, responsible for most of the work. Its main tasks include:
- Flushing dirty pages to the disk.
- Flushing log files to the disk, even if transactions have not yet been committed, to improve transaction commit speed.
- Merging change buffers.
- Recycling undo pages.
I/O Thread#
The I/O thread is primarily responsible for AIO request callbacks. You can change the number of read and write threads by configuring innodb_read_io_threads
and innodb_write_io_threads
.
Purge Thread#
During the execution of transactions, an undo log is recorded to roll back data in case of transaction failure. The purge thread's job is to recycle these undo pages that are no longer needed after the transaction has completed. You can modify the number of this thread by configuring innodb_purge_threads
.
Log Files#
The logs discussed here are actually MySQL logs and are not related to InnoDB. Since they are often needed for viewing and processing, they are included here.
Error Log#
The error log records errors and warnings during the startup, operation, and shutdown processes of MySQL. You can view it through the log_error
variable.
Slow Query Log#
The slow query log records SQL statements that exceed the configured long_query_time
during execution, with a default of 10 seconds. You can enable it with slow_query_log
, which is off by default. The log recording location is specified in the slow_query_log_file
configuration.
General Query Log#
The general query log records all database requests in the general_log_file
. You can enable or disable it through the general_log
configuration. It is usually turned off, as keeping it on can impact performance. It is generally only turned on during debugging.
Binary Log#
The binary log records all changes made to the database (excluding operations like select and show). It can be used for backup recovery and master-slave replication. This log is recorded in the datadir
under the bin_log.xxxx
files. The bin_log.index
is the binary index file, storing the previous bin log sequence numbers.
During the execution of InnoDB transactions, the binary log is recorded in the buffer and, after the transaction is committed, the bin log in the buffer is written to the disk's bin log. You can adjust how many times the buffer is written before syncing to the disk file through the sync_binlog
configuration. The default is 1, meaning that the buffer is flushed to the disk after each write. If this parameter is set too high, it may lead to issues where the bin log has not been flushed to the disk, causing data desynchronization between master and slave.
Storage Structure#
InnoDB stores all data in a space called the shared tablespace. The organizational units of the tablespace, from largest to smallest, are: segment, extent, and page, with each level composed of multiple subordinate units.
Table Structure File#
In MySQL, any storage engine will have a .frm
text file that stores the structure of the table and the structure of user-created views.
Tablespace File#
By default, InnoDB has a shared tablespace for all tables. You can configure innodb_file_per_table
to place the data of each table in a separate tablespace. However, in this case, each table's tablespace only stores data, while the index and change buffer bitmap pages remain in the original shared tablespace. The suffix of tablespace files is .ibd
, and the default shared tablespace file is ibdata1
.
Tablespace files are composed of various segments, such as data segments, index segments, undo segments, etc. A segment consists of multiple extents, and an extent consists of multiple pages, with pages being the smallest unit within InnoDB. Pages come in various types, such as data pages, undo pages, transaction data pages, etc.
Row Record Format#
InnoDB stores data in rows within data pages, kept in the nodes of a B+ tree. If a certain column's data type has a variable length, such as text, the data may be stored in overflow pages, with the original data row retaining an offset to the overflow row. For fixed-length data, if the data length is too large, causing only one row of data to fit in a page and not satisfying the B+ tree structure, InnoDB will automatically store the row data in overflow pages.
Data Partitioning#
Data partitioning is a feature supported by MySQL that allows a table or index data to be physically divided into multiple parts. Currently, MySQL only supports horizontal partitioning, which assigns different rows of a table to different physical files. Moreover, it is a local partition index, meaning that each physical file contains both data and indexes.
MySQL supports the following types of partitioning:
- Range partitioning, which divides data into specified continuous row ranges, with each range stored in a partition (1, 2, 3, 4) (5, 6, 7, 8).
- List partitioning, similar to range, but the partitioned data can be discrete, such as (1, 3, 5, 7) (2, 4, 6, 8).
- Hash partitioning, which partitions data based on the return value of a user-defined expression, such as hash(id % 1000).
- Key partitioning, which partitions data using the hash function provided by the database, requiring only the field to be hashed to be specified.
On the basis of partitioning, MySQL can further partition, known as sub-partitioning. MySQL allows hash or key partitioning on range and list partitions. For specific partitioning syntax, you can refer to the official documentation.
Indexes#
Indexes are the key technology to enhance database query performance. InnoDB supports various types of indexes: B+ tree indexes, full-text indexes, hash indexes, etc. Generally, developers use B+ tree indexes more frequently, so this discussion will focus only on B+ tree indexes, which also come in various types.
Clustered Index#
Every InnoDB data table has a primary key, and the clustered index is a B+ tree constructed according to the table's primary key. The leaf nodes of this tree are the data pages of the table, storing complete row data. Each data page is connected through a doubly linked list, maintaining the same order as the data stored in the table by primary key. All queries for table data, except for cache hits and some special cases, will look up the corresponding data pages through the clustered index tree.
Clustered indexes also exist on disk as physical data, but they are not physically contiguous on disk; rather, they are logically contiguous. For example, the last item of one index node points to the offset of the next index node in the file. Otherwise, maintaining the clustered index would be very costly.
Auxiliary Index#
Each auxiliary index is also a separate B+ tree, but the leaf nodes of this tree do not store complete row data; they only store the auxiliary index key and the table's primary key. When querying rows through the auxiliary index, the primary key is first obtained, and then the complete row data is looked up in the clustered index tree. However, there is a special case where the queried data is contained within the auxiliary index key; in this case, there is no need to look up the clustered index tree, which is also called a covering index
.
Composite Index#
A composite index refers to indexing multiple columns on a table. Its usage is similar to that of a single key auxiliary index. The arrangement order of the index is based on the order of the index declaration. For example, for a composite index (a, b), the data is arranged by first sorting a and then sorting b, such as (1, 1) (1, 2) (2, 1) (2, 2), and so on for more keys.
For queries with multiple key values, such as where a = xx and b = xx
, a composite index can be used. Additionally, in a composite index, the first key is sorted, so for a query like where a = xx
, this single key value query can also utilize the composite index.
Another use case is when we have a query like where a = x group by b limit n
. For a single auxiliary index, after querying, it would still need to sort b. However, in a composite index, if the first field a is a constant value, the second field b is already sorted, which can reduce sorting operations.
Adaptive Hash Index#
InnoDB uses B+ trees to build indexes, and the number of index lookups is related to the depth of the B+. For example, if the tree's depth is 3, a lookup may require 3 queries. If we frequently access certain data, going through 3 index queries each time is somewhat wasteful. InnoDB internally establishes an adaptive hash index (AHI) for each table to cache certain hot data and speed up data queries.
For instance, if we continuously request select a from t where id = 1;
, the AHI will create an index for the pattern where id = 1
. Subsequent identical requests will not need to query the B+ tree index but can directly obtain the data page from the adaptive hash index. Additionally, hash indexes can only perform equality queries and cannot cache range queries.
There are certain conditions for establishing hash indexes, but since hash indexes are an internal optimization of InnoDB, they will not be elaborated on here. Users can enable or disable AHI through the configuration innodb_adaptive_hash_index
.
Locks and Transactions#
Database access must be concurrent, and to ensure data consistency, locks are necessary.
InnoDB has two types of locks:
- Latch: A lightweight lock that is held for a short time, generally used to protect thread data. Latch is also divided into two types:
mutex
andrwlock
. - Lock: This type of lock acts on transactions, locking database objects such as tables and rows. The duration of these locks is relatively long, being released only when the transaction commits or rolls back.
Here, we mainly focus on locks related to transactions. In transactions, InnoDB locks table data at the row level.
Row Locks and Intention Locks#
InnoDB supports two types of row locks:
- Shared Lock (S): Allows a transaction to read a row of data. Shared locks are compatible with any other locks.
- Exclusive Lock (X): Allows a transaction to delete or update a row of data. Exclusive locks are incompatible with any other locks.
Suppose there is a transaction T1 that wants to acquire data from a certain row; it needs to obtain an S lock on this row. If another transaction T2 also wants to access this row and requires an S lock, the two S locks are compatible, so there is no issue. However, if a transaction T3 wants to modify this row, it will need an X lock on this row, but X locks are incompatible with S locks, so T3 will be blocked, waiting for T1 and T2 to release their S locks.
In addition to row locks, InnoDB also provides an Intention Lock
at the table level.
- If modifications are needed on certain rows, an intention exclusive lock (IX) must be obtained on the table before acquiring the row's X lock, indicating that it will request exclusive locks on certain rows.
- If reading certain rows, an intention shared lock (IS) must be obtained on the table before acquiring the row's S lock, indicating that it will request shared locks on certain rows.
Intention locks primarily express the type of locks that a transaction intends to acquire on rows in advance. They do not conflict with table-level locks and are mainly used for comparison and testing with other table locks. For example, if transaction T1 has obtained an X lock on a certain row of data, it must have also acquired the IX lock. If there is a transaction T2 that wants to update the entire table, T2 will need to acquire an X lock on the entire table. T2 needs to know whether any rows are currently held by an X lock. Scanning row by row would be too slow, so it can check the intention locks on the table to see that there is an IX lock, indicating that some rows are currently held by an X lock, and it will have to block until the IX lock is released.
The following table shows the compatibility matrix, where all comparisons of IS, IX
and IS, IX, S, X
are comparisons of table-level locks. For example, the incompatibility of IS and X indicates that the intention shared lock and the exclusive lock at the table level are incompatible, rather than comparing intention locks with row-level locks.
IS | IX | S | X | |
---|---|---|---|---|
IS | Compatible | Compatible | Compatible | Conflicts |
IX | Compatible | Compatible | Conflicts | Conflicts |
S | Compatible | Conflicts | Compatible | Conflicts |
X | Conflicts | Conflicts | Conflicts | Conflicts |
Consistent Non-Locking Reads#
Consistent non-locking reads refer to InnoDB reading rows in the database through MVCC. If the current row is locked with an X lock, it can avoid blocking by reading a snapshot of the row. The read snapshot is implemented using the transaction's undo log. This method can greatly enhance the concurrency performance of the database.
Under the RC and RR isolation levels, InnoDB uses consistent non-locking reads. However, under RC, the snapshot data read is the latest data at that moment, while under RR, the snapshot data is the data at the start of the transaction.
In certain cases, if users need to ensure data consistency, they can use locking methods for consistent locking reads. InnoDB supports two select locking modes:
select ... for update
will apply an X lock on the row data.select ... lock in share mode
will apply an S lock on the data.
These two statements must be used within a transaction, and the locks will be automatically released when the transaction commits or rolls back.
Auto-Increment Lock#
For auto-increment columns, a locking mechanism is needed to ensure data consistency under concurrent writes. Each auto-increment column has a counter in memory used to allocate new values for inserted rows.
InnoDB has a special auto-increment lock, AUTO-INC Locking, to maintain this value. Each time a transaction adds a row, the auto-increment lock will execute a statement select max(auto_inc) from t for update
to obtain the new auto-increment column value. After obtaining it, this auto-increment lock is released immediately, without waiting for the transaction to end. However, this method still locks the table, which is inefficient under high concurrency, as transactions must wait for others to finish using the auto-increment lock. Additionally, if a transaction rolls back, the auto-increment value is discarded, leading to discontinuities in the auto-increment column.
InnoDB allows configuring the value of innodb_autoinc_lock_mode
to control the locking strategy for auto-increment columns:
- 0: Uses the auto-increment lock, which is inefficient.
- 1: Default value; for insert and replace statements, it adds a mutex lock to the in-memory counter. There are no transaction locks, which is relatively faster. Other types of inserts still use the auto-increment lock.
- 2: All inserts use a mutex lock, which is highly efficient but may lead to discontinuities in the auto-increment values.
For more detailed information on auto-increment columns, you can refer to the official documentation.
Lock Algorithms#
There are three algorithms for row locks:
- Record Lock: Locks a single row. For example, when we want to modify a certain row of data.
- Gap Lock: Locks a range, excluding the record itself. For example, if the current table contains the values 1, 3, 5, and 7, and a transaction wants to add the value 4, it will lock the range (3, 5) to prevent other transactions from inserting data in between. If another value, 8, needs to be inserted, it will also need to add a range lock on (7, +∞). The main purpose of Gap Lock is to prevent multiple transactions from inserting into the same range. Setting the isolation level to RC can disable Gap Lock.
- Next-Key Lock: This is essentially a combination of the previous two locks, locking both the record itself and a range. This lock is primarily used to solve the phantom read problem.
When the queried index contains unique attributes, InnoDB will downgrade the Next-Key Lock to a Record Lock. For example, if a transaction needs to insert a row into the table with id=7 ...
, and id is a unique index, it only needs to lock row 7 and does not need to lock the range.
Phantom reads refer to the situation where executing the same SQL statement twice within the same transaction may yield different results, with the second execution possibly returning rows that did not exist before.
For instance, if transaction T1 first executes the statement select id from t where id > 10;
, and no data is returned this time. However, simultaneously, another transaction T2 executes the statement insert into t (id) values (11);
, and T2 commits first. When T1 executes select id from t where id > 10;
again, it suddenly finds data. This behavior violates the isolation of transactions, as one transaction can perceive the results of another.
By using Next-Key Lock, when we execute the statement select id from t where id > 10;
, an X lock will be added on the range (10, +∞), preventing T2 from inserting values in that range. When T1 executes the same statement again, the result will certainly not change, as this range is already locked with an X lock, and no transaction can modify the data within this range.
InnoDB's default transaction isolation level, RR, uses Next-Key Lock to protect transactions and avoid phantom reads.
Using Locks to Solve Transaction Isolation Issues#
- Dirty Read: This occurs when a transaction can read modifications made by another transaction that has not yet been committed. Dirty reads only happen under the RU isolation level. At least under the RC level, uncommitted data in a transaction will not be visible to other transactions.
- Non-Repeatable Reads/Phantom Reads: The solutions have already been discussed above and will not be repeated.
- Lost Updates: This refers to two transactions updating the same row, where the result of the first committed transaction is overwritten by the second. This is not inherently a database issue but a possible outcome of concurrent transactions. One solution is to use the aforementioned consistent locking reads to serialize transactions, which means using the SERIALIZABLE isolation level.
Transaction ACID and Isolation Levels#
With robust lock protection, InnoDB implements transactions that fully comply with ACID:
- Atomicity: A transaction is an indivisible unit; all operations within the transaction must succeed for the transaction to be considered successful. If any operation in the transaction fails, all previously executed operations must be rolled back.
- Consistency: The database's constraints must not be violated before and after the transaction begins and ends. For example, unique constraints.
- Isolation: The read and write operations of each transaction are invisible to other transactions until the transaction is committed.
- Durability: Once a transaction is committed, the result is permanent, even in the event of a crash.
The SQL standard defines four isolation levels for transactions:
- READ UNCOMMITTED: Allows a transaction to read uncommitted modifications from other transactions, leading to dirty reads.
- READ COMMITTED: Allows a transaction to read modifications that have been committed by other transactions, but may encounter phantom reads.
- REPEATABLE READ: A transaction can read new records inserted by other transactions but cannot read modifications made by other transactions. This is the default isolation level used by InnoDB, and it uses Next-Key Lock to solve the phantom read problem.
- SERIALIZABLE: Each read and write operation in a transaction requires acquiring table-level shared locks.
Generally speaking, the lower the isolation level, the less lock protection a transaction has, and the shorter the duration of locks.
InnoDB Transaction's Redo Log and Undo Log#
Before a transaction is committed, all logs must be written to the redo log file. This ensures that even in the event of a crash, recovery can be performed from the redo log, guaranteeing data durability. Additionally, to ensure that the redo log can be written to the file, every write to the redo log involves an fsync
. The strategy for flushing the redo log to disk can be adjusted through the innodb_flush_log_at_trx_commit
configuration. The default is 1, meaning that fsync
is executed every time to ensure writing. When set to 0, the redo log is not written to the file at transaction commit. When set to 2, the redo log is written to the file system but not executed with fsync
, which may lead to data loss if a crash occurs before the log is flushed to disk. The strategy for flushing the redo log significantly impacts the speed of transaction commits.
During the execution of a transaction, in addition to the redo log, undo logs are also generated. When a transaction rolls back, these undo logs are executed to revert the data modifications to their original state. Undo logs are logical logs and do not restore data to the state at the beginning of the transaction, as multiple concurrent transactions may have already modified the original data.
Transaction Control Statements#
In the default settings of the MySQL command line, transactions are automatically committed. Each SQL statement executed will immediately perform a COMMIT. You can disable auto-commit by using SET AUTOCOMMIT=0
. You can also explicitly start a transaction using the BEGIN command. Here is a summary of transaction control statements:
- BEGIN: Starts a transaction.
- COMMIT: Commits the transaction.
- ROLLBACK: Rolls back the transaction.
- SAVEPOINT id: Creates a checkpoint id.
- RELEASE SAVEPOINT id: Deletes a checkpoint.
- ROLLBACK TO id: Rolls back to checkpoint id; only operations before this checkpoint will be rolled back.
- SET TRANSACTION: Sets the isolation level for the transaction.
References and Citations
MySQL Technical Insider InnoDB Storage Engine (Second Edition)
MySQL InnoDB Official Documentation