/** The asynchronous i/o array slot structure */ typedef struct os_aio_slot_struct os_aio_slot_t; /** The asynchronous i/o array slot structure */ struct os_aio_slot_struct{ ibool is_read; /*!< TRUE if a read operation */ ulint pos; /*!< index of the slot in the aio array */ ibool reserved; /*!< TRUE if this slot is reserved */ time_t reservation_time;/*!< time when reserved */ ulint len; /*!< length of the block to read or write */ byte* buf; /*!< buffer used in i/o */ ulint type; /*!< OS_FILE_READ or OS_FILE_WRITE */ ulint offset; /*!< 32 low bits of file offset in bytes */ ulint offset_high; /*!< 32 high bits of file offset */ os_file_t file; /*!< file where to read or write */ const char* name; /*!< file name or path */ ibool io_already_done;/*!< used only in simulated aio: TRUE if the physical i/o already made and only the slot message needs to be passed to the caller of os_aio_simulated_handle */ fil_node_t* message1; /*!< message which is given by the */ void* message2; /*!< the requester of an aio operation and which can be used to identify which pending aio operation was completed */ #ifdef WIN_ASYNC_IO HANDLE handle; /*!< handle object we need in the OVERLAPPED struct */ OVERLAPPED control; /*!< Windows control block for the aio request */ #elif defined(LINUX_NATIVE_AIO) struct iocb control; /* Linux control block for aio */ int n_bytes; /* bytes written/read. */ int ret; /* AIO return code */ #endif }; /** The asynchronous i/o array structure */ typedef struct os_aio_array_struct os_aio_array_t; /** The asynchronous i/o array structure */ struct os_aio_array_struct{ os_mutex_t mutex; /*!< the mutex protecting the aio array */ os_event_t not_full; /*!< The event which is set to the signaled state when there is space in the aio outside the ibuf segment */ os_event_t is_empty; /*!< The event which is set to the signaled state when there are no pending i/os in this array */ ulint n_slots;/*!< Total number of slots in the aio array. This must be divisible by n_threads. 就是线程个数*每个线程充许的个数*/ ulint n_segments; /*!< Number of segments in the aio array of pending aio requests. A thread can wait separately for any one of the segments. 就是IO的线程个数,读写默认都是各4个*/ ulint cur_seg;/*!< We reserve IO requests in round robin fashion to different segments. This points to the segment that is to be used to service next IO request. */ ulint n_reserved; /*!< Number of reserved slots in the aio array outside the ibuf segment */ os_aio_slot_t* slots; /*!< Pointer to the slots in the array */ #ifdef __WIN__ HANDLE* handles; /*!< Pointer to an array of OS native event handles where we copied the handles from slots, in the same order. This can be used in WaitForMultipleObjects; used only in Windows */ #endif #if defined(LINUX_NATIVE_AIO) io_context_t* aio_ctx; /* completion queue for IO. There is one such queue per segment. Each thread will work on one ctx exclusively. */ struct io_event* aio_events; /* The array to collect completed IOs. There is one such event for each possible pending IO. The size of the array is equal to n_slots. */ #endif }; /** The aio arrays for non-ibuf i/o and ibuf i/o, as well as sync aio. These are NULL when the module has not yet been initialized. @{ */ static os_aio_array_t* os_aio_read_array = NULL; /*!< Reads */ static os_aio_array_t* os_aio_write_array = NULL; /*!< Writes 在double write中,备份用同步写,数据文件用异步写,其实是一个线程写,四个线程在等待AIO结束*/ static os_aio_array_t* os_aio_ibuf_array = NULL; /*!< Insert buffer */ static os_aio_array_t* os_aio_log_array = NULL; /*!< Redo log 在log的checkpoint时*/ static os_aio_array_t* os_aio_sync_array = NULL; /*!< Synchronous I/O 事务commit时的log*/
等待的线程
io_handler_thread 等线程的主函数,每个线程都对应一个对列,read,write是4个线程对应一个队列
fil_aio_wait 主要是等待IO结束
buf_page_io_complete 读页面的内容,释放对页面的锁
读页面的过程
buf_read_page_low Sets the io_fix flag and sets an exclusive lock on the buffer frame. The flag is cleared and the x-lock released by an i/o-handler thread.
fil_io
os_aio
os_aio_array_reserve_slot 上队列里面找到空闲的,调用io_prep_pread 或 io_prep_pwrite
os_aio_linux_dispatch 调用 io_submit, 一个页面就调用,而不是多个页面一块调
read ahead
http://dev.mysql.com/doc/refman/5.5/en/innodb-performance-read_ahead.html
/*A read-ahead request is an I/O request to prefetch multiple pages in the buffer pool asynchronously, in anticipation that these pages will be needed soon. InnoDB uses or has used two read-ahead algorithms to improve I/O performance: Linear read-ahead is based on the access pattern of the pages in the buffer pool, not just their number. You can control when InnoDB performs a read-ahead operation by adjusting the number of sequential page accesses required to trigger an asynchronous read request, using the configuration parameter innodb_read_ahead_threshold. Before this parameter was added, InnoDB would only calculate whether to issue an asynchronous prefetch request for the entire next extent when it read in the last page of the current extent. Random read-ahead is a former technique that has now been removed as of MySQL 5.5. If a certain number of pages from the same extent (64 consecutive pages) were found in the buffer pool, InnoDB asynchronously issued a request to prefetch the remaining pages of the extent. Random read-ahead added unnecessary complexity to the InnoDB code and often resulted in performance degradation rather than improvement. This feature is no longer part of InnoDB, and users should generally see equivalent or improved performance. If the number of pages read from an extent of 64 pages is greater or equal to innodb_read_ahead_threshold, InnoDB initiates an asynchronous read-ahead operation of the entire following extent. Thus, this parameter controls how sensitive InnoDB is to the pattern of page accesses within an extent in deciding whether to read the following extent asynchronously. The higher the value, the more strict the access pattern check. For example, if you set the value to 48, InnoDB triggers a linear read-ahead request only when 48 pages in the current extent have been accessed sequentially. If the value is 8, InnoDB would trigger an asynchronous read-ahead even if as few as 8 pages in the extent were accessed sequentially. The new configuration parameter innodb_read_ahead_threshold can be set to any value from 0-64. The default value is 56, meaning that an asynchronous read-ahead is performed only when 56 of the 64 pages in the extent are accessed sequentially. You can set the value of this parameter in the MySQL option file (my.cnf or my.ini), or change it dynamically with the SET GLOBAL command, which requires the SUPER privilege. */ /********************************************************************//** Applies linear read-ahead if in the buf_pool the page is a border page of a linear read-ahead area and all the pages in the area have been accessed. Does not read any page if the read-ahead mechanism is not activated. Note that the algorithm looks at the 'natural' adjacent successor and predecessor of the page, which on the leaf level of a B-tree are the next and previous page in the chain of leaves. To know these, the page specified in (space, offset) must already be present in the buf_pool. Thus, the natural way to use this function is to call it when a page in the buf_pool is accessed the first time, calling this function just after it has been bufferfixed. */ /** The size in pages of the area which the read-ahead algorithms read if invoked */ #define BUF_READ_AHEAD_AREA(b) ut_min(64, ut_2_power_up((b)->curr_size / 32))
/* IMPLEMENTATION OF THE BUFFER POOL ================================= Performance improvement: ------------------------ Thread scheduling in NT may be so slow that the OS wait mechanism should not be used even in waiting for disk reads to complete. Rather, we should put waiting query threads to the queue of waiting jobs, and let the OS thread do something useful while the i/o is processed. In this way we could remove most OS thread switches in an i/o-intensive benchmark like TPC-C. A possibility is to put a user space thread library between the database and NT. User space thread libraries might be very fast. SQL Server 7.0 can be configured to use 'fibers' which are lightweight threads in NT. These should be studied. Buffer frames and blocks ------------------------ Following the terminology of Gray and Reuter, we call the memory blocks where file pages are loaded buffer frames. For each buffer frame there is a control block, or shortly, a block, in the buffer control array. The control info which does not need to be stored in the file along with the file page, resides in the control block. Buffer pool struct ------------------ The buffer buf_pool contains a single mutex which protects all the control data structures of the buf_pool. The content of a buffer frame is protected by a separate read-write lock in its control block, though. These locks can be locked and unlocked without owning the buf_pool->mutex. The OS events in the buf_pool struct can be waited for without owning the buf_pool->mutex. The buf_pool->mutex is a hot-spot in main memory, causing a lot of memory bus traffic on multiprocessor systems when processors alternately access the mutex. On our Pentium, the mutex is accessed maybe every 10 microseconds. We gave up the solution to have mutexes for each control block, for instance, because it seemed to be complicated. A solution to reduce mutex contention of the buf_pool->mutex is to create a separate mutex for the page hash table. On Pentium, accessing the hash table takes 2 microseconds, about half of the total buf_pool->mutex hold time. Control blocks -------------- The control block contains, for instance, the bufferfix count which is incremented when a thread wants a file page to be fixed in a buffer frame. The bufferfix operation does not lock the contents of the frame, however. For this purpose, the control block contains a read-write lock. The buffer frames have to be aligned so that the start memory address of a frame is divisible by the universal page size, which is a power of two. We intend to make the buffer buf_pool size on-line reconfigurable, that is, the buf_pool size can be changed without closing the database. Then the database administarator may adjust it to be bigger at night, for example. The control block array must contain enough control blocks for the maximum buffer buf_pool size which is used in the particular database. If the buf_pool size is cut, we exploit the virtual memory mechanism of the OS, and just refrain from using frames at high addresses. Then the OS can swap them to disk. The control blocks containing file pages are put to a hash table according to the file address of the page. We could speed up the access to an individual page by using "pointer swizzling": we could replace the page references on non-leaf index pages by direct pointers to the page, if it exists in the buf_pool. We could make a separate hash table where we could chain all the page references in non-leaf pages residing in the buf_pool, using the page reference as the hash key, and at the time of reading of a page update the pointers accordingly. Drawbacks of this solution are added complexity and, possibly, extra space required on non-leaf pages for memory pointers. A simpler solution is just to speed up the hash table mechanism in the database, using tables whose size is a power of 2. Lists of blocks --------------- There are several lists of control blocks. The free list (buf_pool->free) contains blocks which are currently not used. The common LRU list contains all the blocks holding a file page except those for which the bufferfix count is non-zero. The pages are in the LRU list roughly in the order of the last access to the page, so that the oldest pages are at the end of the list. We also keep a pointer to near the end of the LRU list, which we can use when we want to artificially age a page in the buf_pool. This is used if we know that some page is not needed again for some time: we insert the block right after the pointer, causing it to be replaced sooner than would normally be the case. Currently this aging mechanism is used for read-ahead mechanism of pages, and it can also be used when there is a scan of a full table which cannot fit in the memory. Putting the pages near the end of the LRU list, we make sure that most of the buf_pool stays in the main memory, undisturbed. The unzip_LRU list contains a subset of the common LRU list. The blocks on the unzip_LRU list hold a compressed file page and the corresponding uncompressed page frame. A block is in unzip_LRU if and only if the predicate buf_page_belongs_to_unzip_LRU(&block->page) holds. The blocks in unzip_LRU will be in same order as they are in the common LRU list. That is, each manipulation of the common LRU list will result in the same manipulation of the unzip_LRU list. The chain of modified blocks (buf_pool->flush_list) contains the blocks holding file pages that have been modified in the memory but not written to disk yet. The block with the oldest modification which has not yet been written to disk is at the end of the chain. The access to this list is protected by buf_pool->flush_list_mutex. The chain of unmodified compressed blocks (buf_pool->zip_clean) contains the control blocks (buf_page_t) of those compressed pages that are not in buf_pool->flush_list and for which no uncompressed page has been allocated in the buffer pool. The control blocks for uncompressed pages are accessible via buf_block_t objects that are reachable via buf_pool->chunks[]. The chains of free memory blocks (buf_pool->zip_free[]) are used by the buddy allocator (buf0buddy.c) to keep track of currently unused memory blocks of size sizeof(buf_page_t)..UNIV_PAGE_SIZE / 2. These blocks are inside the UNIV_PAGE_SIZE-sized memory blocks of type BUF_BLOCK_MEMORY that the buddy allocator requests from the buffer pool. The buddy allocator is solely used for allocating control blocks for compressed pages (buf_page_t) and compressed page frames. Loading a file page ------------------- First, a victim block for replacement has to be found in the buf_pool. It is taken from the free list or searched for from the end of the LRU-list. An exclusive lock is reserved for the frame, the io_fix field is set in the block fixing the block in buf_pool, and the io-operation for loading the page is queued. The io-handler thread releases the X-lock on the frame and resets the io_fix field when the io operation completes. A thread may request the above operation using the function buf_page_get(). It may then continue to request a lock on the frame. The lock is granted when the io-handler releases the x-lock. Read-ahead ---------- The read-ahead mechanism is intended to be intelligent and isolated from the semantically higher levels of the database index management. From the higher level we only need the information if a file page has a natural successor or predecessor page. On the leaf level of a B-tree index, these are the next and previous pages in the natural order of the pages. Let us first explain the read-ahead mechanism when the leafs of a B-tree are scanned in an ascending or descending order. When a read page is the first time referenced in the buf_pool, the buffer manager checks if it is at the border of a so-called linear read-ahead area. The tablespace is divided into these areas of size 64 blocks, for example. So if the page is at the border of such an area, the read-ahead mechanism checks if all the other blocks in the area have been accessed in an ascending or descending order. If this is the case, the system looks at the natural successor or predecessor of the page, checks if that is at the border of another area, and in this case issues read-requests for all the pages in that area. Maybe we could relax the condition that all the pages in the area have to be accessed: if data is deleted from a table, there may appear holes of unused pages in the area. A different read-ahead mechanism is used when there appears to be a random access pattern to a file. If a new page is referenced in the buf_pool, and several pages of its random access area (for instance, 32 consecutive pages in a tablespace) have recently been referenced, we may predict that the whole area may be needed in the near future, and issue the read requests for the whole area. */
看来mysql已放弃,读一个页面时,把这个页面相邻的页面也读进来的random read ahead, 要执行liner read ahead,需要有三个前提条件
1。当前正在读的是border页面,并且是第一次读入内存,当等待这个要读的页面读完后
2。在当前这个页面区间内,已有阀值个页面被读入,默认的阀值是56,最少有56个相邻页面,才进行read ahead下一个区间,是循环64次,每次读入一个页面
3。本页面所在的区间,和下次要读的区间是物理相邻的,并且不等待返回这个区间的返回
比如:
0--63 64--127 这个区间中,当读到63页面时,把下一个区间64个页面读入
win下面缓存读写和直接读写的不同点,对于直接读写对应的参数值为async_unbuffered,对于缓存读写对应的参数值为normal。主要是分析open和write的不同。
1.Win直接读写。
mysql帮助文档里面指明在win下面只能使用async_unbuffered并且不能更改,这个参数是使用不带系统缓存的异步读写方式。数据文件和日志都走的是Windows native AIO这条路,只不过日志在wirte之后,会等待异步写返回。
Open:
CreateFile((LPCTSTR) name,
GENERIC_READ | GENERIC_WRITE,
share_mode,
NULL,
create_flag,
FILE_FLAG_OVERLAPPED | FILE_FLAG_NO_BUFFERING,
NULL);
Write: 对于数据文件和日志文件是一样的
WriteFile(file, buf, (DWORD)n, &len, &(slot->control));
在日志的异步写完之后,会有一个WaitForSingleObject操作,等待日志写操作返回。
2.Win缓存读写
Win下面通过查看源码,发现还支持normal这个值,当设置这个值时,实现的就是缓存读写,没有用到任何Simulated AIO和Windows native AIO的数据结构和线程,数据文件和日志文件的读写操作就是直接调用系统函数。
Open: 对于数据文件和日志文件是一样的
CreateFile((LPCTSTR) name,
GENERIC_READ | GENERIC_WRITE,
share_mode,
NULL,
create_flag,
0,
NULL);
Write:对于数据文件和日志文件是一样的
WriteFile(file, buf, (DWORD) n, &len, NULL);
以下分析Linux下面缓存读写和直接读写的不同点,对于直接读写对应的参数值为O_DIRECT,对于缓存读写对应的参数值为O_DSYNC、fdatasync。
1.Linux下面直接读写
这个参数是使用不带系统缓存的异步读写方式。数据文件走的是Linux native AIO这条路,日志文件走的是缓存读写这条路。
Open:
对于数据文件:open(name, O_RDWR | O_DIRECT, os_innodb_umask);
对于日志文件:open(name, O_RDWR, os_innodb_umask);
Write:
对于数据文件:走的是Linux native AIO这条路
对于日志文件:pwrite(file, buf, (ssize_t)n, offs);
2.Linux下面缓存读写O_DSYNC
数据文件和日志文件走的是缓存读写这条路。日志文件中打开时多了一个参数。
Open:
对于数据文件:open(name, O_RDWR , os_innodb_umask);
对于日志文件:open(name, O_RDWR | O_SYNC, os_innodb_umask);
Write: 对于数据文件和日志文件是一样的
pwrite(file, buf, (ssize_t)n, offs);
3.Linux下面缓存读写fdatasync
数据文件和日志文件走的是缓存读写这条路。
Open: 对于数据文件和日志文件是一样的
open(name, O_RDWR , os_innodb_umask);
Write: 对于数据文件和日志文件是一样的
pwrite(file, buf, (ssize_t)n, offs);
从上面可以看过,对于直接读写,在win和Linux,对于日志的实现是不一样的,win下面是直接异步读写,加一个等待操作;Linux下面是缓存读写。
对于缓存读写和直接读写的实现:
在Win下面,直接读写走的是Windows Native AIO这条路,缓存读写就是没用使用Simulated AIO和Windows Native AIO的线程和数据结构,直接调用读写函数。
在Linux下面,直接读写走的是Linux Native AIO这条路,缓存读写走的是缓存读写这条路。
write操作后,我们还调用了fdatasync来确保文件数据flush到了disk上。fdatasync返回成功后,那么可以认为数据已经写到了磁盘上。像这样的flush的函数还有fsync、sync。Sync函数表示将文件在OS cache中的数据排入写队列,并不确认是否真的写磁盘了,所以sync并不可以靠。fsync函数只对由文件描述符filedes指定的单一文件起作用,并且等待写磁盘操作结束,然后返回。fsync可用于数据库这样的应用程序,这种应用程序需要确保将修改过的块立即写到磁盘上。fdatasync函数类似于fsync,但它只影响文件的数据部分。而除数据外,fsync还会同步更新文件的属性。这里需要特别说明一下的是目前glibc中fdatasync函数的实现已经和fsync一摸一样。
忽略文件打开的过程,通常我们会说“写文件”有两个阶段,一个是调用write我们称为写数据阶段(其实是受open的参数影响),调用fsync(或者fdatasync)我们称为flush阶段。对于日志文件的flush主要由下面的系统参数决定。
参数Innodb_flush_method(Linux)可以设定为:Fdatasync、O_DSYNC、O_DIRECT。我们看看这个三个参数是如何影响程序MySQL对日志和数据文件的操作:
Open log
Flush log
Open datafile
Flush data
Fdatasync
O_RDWR
fsync()
O_RDWR
fsync()
O_DSYNC
O_RDWR|O_SYNC
O_RDWR
fsync()
O_DIRECT
fsync()
O_DIRECT
fsync()
参数Innodb_flush_method(Win)可以设定为:normal、async_unbuffered我们看看这个二个参数是如何影响程序MySQL对日志和数据文件的操作:
Open log
Flush log
Open datafile
Flush data
normal
FlushFileBuffers()
FlushFileBuffers()
async_unbuffered
FILE_FLAG_OVERLAPPED | FILE_FLAG_NO_BUFFERING
FlushFileBuffers()
FILE_FLAG_OVERLAPPED | FILE_FLAG_NO_BUFFERING
FlushFileBuffers()
对于mysql中的日志,flush的时机还由参数innodb_flush_log_at_trx_commit决定
1(默认)
在事务commit时,把日志write进硬盘,之后flush
0
隔1秒种,把日志write进硬盘,之后flush;事务commit时,无操作
2
在事务commit时,把日志write进硬盘,不flush;每隔1秒,进行flush
默认1是能保证数据库的ACID属性的。当为其它值时,能提高数据库的性能。
除上文档上面介绍的,对于日志,通过查看源代码,发现在Simulated AIO或是Native AIO这种写操作结束之后,在write waiter这种线程中,每一次写操作完成时,也会调用flush。具体看下面代码。
if (srv_unix_file_flush_method !=SRV_UNIX_O_DSYNC
&& srv_unix_file_flush_method !=SRV_UNIX_NOSYNC
&& srv_flush_log_at_trx_commit != 2) {
fil_flush(group->space_id);
}
在Linux下面,同步IO和异步IO与文件是否带缓存没有直接关系, 对于打开的文件,支持缓存读写和直接读写。
对于日志文件,在事务start和commit时,是用带缓存的同步IO写入的,在check point时,用的是带缓存的异步IO写入
对于数据文件,在double write buffer时,是采用带缓存的同步IO写入,在普通的表写入时,是采用不带缓存的异步IO写入
在win下面,由于不支持同一个文件,既带缓存读写,又不带缓存读写,按默认的配置,是根据操作系统的版本,采用对数据文件和日志文件都是异步不带缓存的读写,日志文件在写完后,有一个wait的操作,等待写操作返回。
对于flush,
在linux下面,日志文件有带缓存的同步写,所以在事务commit和每次写操作完成后都有flush.对于数据文件,是在master thread里面每隔10秒,在把脏页面向硬盘写时完成的,此时先有double buffer write
对于Win下面,采用异步不带缓存的写,但是也有flush,