2009-04 InnoDB Internals: InnoDB File Formats and Source Code Structure (MySQL Conference 2009)

1. Transactional Storage for MySQL FAST. RELIABLE. PROVEN. InnoDB Internals: InnoDB File Formats and Source Code Structure MySQL Conference, April 2009 Heikki Tuuri CEO Innobase Oy Vice President, Development Oracle Corporation Calvin Sun Principal Engineer Oracle Corporation
2. Today’s Topics • • • • • • • Goals of InnoDB Key Functional Characteristics InnoDB Design Considerations InnoDB Architecture InnoDB On Disk Format Source Code Structure Q&A
3. Goals of InnoDB • • • • OLTP oriented Performance, Reliability, Scalability Data Protection Portability
4. InnoDB Key Functional Characteristics • • • • • Full transaction support Row-level locking MVCC Crash recovery Efficient IO
5. Design Considerations • Modeled on Gray & Reuter’s “Transactions Processing: Concepts & Techniques” • Also emulated the Oracle architecture • Added unique subsystems • Doublewrite • Insert buffering • Adaptive hash index • Designed to evolve with changing hardware & requirements
6. InnoDB Architecture Server Applications Handler API Embedded InnoDB API Transaction Cursor / Row Minitransaction B-tree Page Buffer File Space Manager IO Lock
7. InnoDB On Disk Format • • • • • • • InnoDB Database Files InnoDB Tablespaces InnoDB Pages / Extents InnoDB Rows InnoDB Indexes InnoDB Logs File Format Design Considerations
8. InnoDB Database Files System tablespace MySQL Data Directory InnoDB tables internal data dictionary insert buffer undo logs ibdata files .frm files OR innodb_file_per_table .ibd files
9. InnoDB Tablespaces • A tablespace consists of multiple files and/or raw disk partitions. file_name:file_size[:autoextend[:max:max_file_size]] • A file/partition is a collection of segments. • A segment consists of fixed-length pages. • The page size is always 16KB in uncompressed tablespaces, and 1KB-16KB in compressed tablespaces (for both data and index).
10. System Tablespace • • • • • Internal Data Dictionary Undo Insert Buffer Doublewrite Buffer MySQL Replication Info
11. InnoDB Tablespaces Tablespace Segment Extent Extent Extent Extent Leaf node segment Non-leaf node segment Extent Rollback segment Page Row Row Trx id Row Row Row Row Roll pointer Field pointers Field 1 Field 2 Row Row Field n an extent = 64 pages
12. InnoDB Pages InnoDB Page Types Symbol Value FIL_PAGE_INODE 3 FIL_PAGE_INDEX 17855 FIL_PAGE_TYPE_BLOB 10 Uncompressed BLOB page FIL_PAGE_TYPE_ZBLOB 11 1st compressed BLOB page FIL_PAGE_TYPE_ZBLOB2 12 Subsequent compressed BLOB page FIL_PAGE_TYPE_SYS 6 System page FIL_PAGE_TYPE_TRX_SYS 7 Transaction system page others Notes File segment inode B-tree node i-buf bitmap, I-buf free list, file space header, extent desp page, new allocated page
13. InnoDB Pages A page consists of: a page header, a page trailer, and a page body (rows or other contents). Page header Row Row Row Row Row Row Row Row Row Row Row row offset array Page trailer
14. Page Declares typedef struct { ulint pageno; ulint boffset; } fil_addr_t; typedef struct { ulint checksum; ulint page_offset; fil_addr_t previous; fil_addr_t next; dulint page_lsn; /* a space address */ /* page number within the file */ /* byte offset within the page */ /* /* /* /* /* checksum of the page (since 4.0.14) */ page offset inside space */ offset or fil_addr_t */ offset or fil_addr_t */ lsn of the end of the newest modification log record to the page */ PAGE_TYPE page type; /* file page type */ dulint file_flush_lsn;/* the file has been flushed to disk at least up to this lsn */ int space_id; /* space id of the page */ char data[]; /* will grow */ ulint page_lsn; /* the last 4 bytes of page_lsn */ ulint checksum; /* page checksum, or checksum magic, or 0 */ } PAGE, *PAGE;
15. InnoDB Compressed Pages Page header • InnoDB keeps a “modification log” in each page • Updates & inserts of small compressed data records are written to the log w/o page reconstruction; deletes don’t even require uncompression modification log empty space BLOB pointers page directory Page trailer • Log also tells InnoDB if the page will compress to fit page size • When log space runs out, InnoDB uncompresses the page, applies the changes and recompresses the page
16. InnoDB Rows … prefix(768B) … COMACT format overflow page 20 bytes … … DYNAMIC format overflow page Record hdr Trx ID Roll ptr Fld ptrs overflow-page ptr .. Field values
17. InnoDB Indexes - Primary … 001 – 500 001 275 ●Data PK values 001 - nnn … 500 – 800 276 – 500 501 630 631 768 Key values 501501-630 + data for corresponding rows 801 – nnn 769 800 801 949 950 xxx xxx nnn clustered (primary key) index Primary Index rows are stored in the B-tree leaf nodes of a clustered index ● B-tree is organized by primary key or non-null unique key of table, if defined; else, an internal column with 6-byte ROW_ID is added.
18. InnoDB Indexes - Secondary ● Secondary index Btree leaf nodes contain, for each key value, the primary keys of the corresponding rows, used to access clustering index to obtain the data clustered clustered (primary key) PK values (primary key) index 001 index - nnn B-tree leaf nodes, containing data key values A Z Secondary Index B-tree leaf nodes, containing PKs Secondary index Secondary index
19. InnoDB Logging Rollback segments Log Buffer Buffer Pool log thread write thread Log File #1 redo log Log File #2 DATA log files rollback ibdata files
20. InnoDB Redo Log end of log start of log last checkpoint min LSN Redo log structure: Space id PageNo OpCode Data
21. File Format Management • Builtin InnoDB format: “Antelope” • New “Barracuda” format enables compression,ROW_FORMAT=DYNAMIC .ibd data files (file per table) • Fast index creation, other features do not require Barracuda file format • Builtin InnoDB can access “Antelope” databases, but not “Barracuda” databases • Check file format tag in system tablespace on startup • Enable a file format with new dynamic parameter innodb_file_format • Preserves ability to downgrade easily
22. InnoDB File Format Design Considerations • Durability • Logging, doublewrite, checksum; • Performance • Insert buffering, table compression • Efficiency • Dynamic row format, table compression • Compatibility • File format management
23. Source Code Structure • 31 subdirectories • Relevant InnoDB source files on file formats • Tablespace: fsp0fsp {.c, .ic, .h} • Page: page0page, page0zip {.c, .ic, .h} • Log: log0log {.c, .ic, .h}
24. Source Code Subdirectories • • • • • • • • • • • buf data db dict dyn eval fil fsp fut ha handler • • • • • • • • • • ibuf include lock log math mem mtr os page pars • • • • • • • • • • que read rem row srv sync thr trx usr ut
25. Summary: Durability, Performance, Compatibility & Efficiency • InnoDB is the leading transactional storage engine for MySQL • InnoDB’s architecture is well-suited to modern, online transactional applications; as well as embedded applications. • InnoDB’s file format is designed for high durability, better performance, and easy to manage
26. For More Information … 2009 MySQL User Conference InnoDB Birds of a Feather Wed 7:30pm Ballroom C • Heikki Tuuri: Concurrency Control: How it Really Works, Thurs, 2:50pm Please visit www.innodb.com, blogs.innodb.com and forums.innodb.com
28. an company Plugin Hot Backup Embedded
29. InnoDB Size Limits • • • • Max # of tables: 4 G Max size of a table:'>table: 32TB Columns per table:'>table: 1000 Max row size:'>size: n*4 GB • 8 kB if stored on the same page • n*4 GB with n BLOBs • Max key length: 3500 • Maximum tablespace size:'>size: 64 TB • Max # of concurrent trxs: 1023