PLMCE2014-Innodb-Architecture-And-Performance-Optimization
PLMCE2014-Innodb-Architecture-And-Performance-Optimization
PLMCE2014-Innodb-Architecture-And-Performance-Optimization
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
InnoDB <strong>Architecture</strong>, and <br />
<strong>Performance</strong> Op7miza7on <br />
Peter Zaitsev, Ovais Tariq <br />
Percona Live, Santa Clara,CA <br />
April 1,2014 <br />
1 <br />
www.percona.com
<strong>Architecture</strong> and <strong>Performance</strong> <br />
• Advanced <strong>Performance</strong> Op?miza?on requires <br />
transparency/X-‐ray Vision <br />
• Impossible without understanding system <br />
architecture <br />
• Focus on Conceptual Aspects <br />
– Exact Checksum algorithm InnoDB uses is not <br />
important <br />
– What maSers <br />
• How fast is that algorithm ? <br />
• How checksums are checked/updated <br />
2 <br />
www.percona.com
Aspects of <strong>Architecture</strong> <br />
• General <strong>Architecture</strong> <br />
• Storage Files Layout <br />
• Threads <strong>Architecture</strong> <br />
• Indexes <br />
• Buffer Pool <br />
• Page Cleaner Thread <br />
• Adap?ve Hash Index <br />
• Change Buffer <br />
3 <br />
www.percona.com
Aspects of <strong>Architecture</strong> 2 <br />
• Doublewrite Buffer <br />
• Redo Logging <br />
• Recovery <br />
• Mul? Versioning <br />
• Locking and Latching <br />
• BLOB Storage <br />
• Compression <br />
• Miscellaneous <br />
4 <br />
www.percona.com
InnoDB Versions <br />
• Below MySQL 5.5 <br />
– Ancient History. Upgrade Recommended <br />
• MySQL 5.5 <br />
– Scalability further improved <br />
• MySQL 5.6 (GA) <br />
– Further improvements in Scalability <br />
– Full Text Search, fast checksums etc. <br />
– Focus of today <br />
• MySQL 5.7 <br />
– Future <br />
5 <br />
www.percona.com
XtraDB <br />
• Follows MySQL/InnoDB Versions <br />
• Included in Percona Server and MariaDB <br />
– No more available as separate plugin <br />
• Includes all InnoDB features and <br />
improvements plus many more <br />
– hSp://goo.gl/bnq5sI <br />
• Percona Server 5.6 is latest GA version <br />
6 <br />
www.percona.com
Overview of General InnoDB <strong>Architecture</strong> <br />
GENERAL ARCHITECTURE <br />
7 <br />
www.percona.com
General <strong>Architecture</strong> <br />
• Tradi?onal OLTP Engine <br />
– “Emulates Oracle <strong>Architecture</strong>” <br />
• Implemented using MySQL Storage engine API <br />
• Row Based Storage. Row Locking. MVCC <br />
• Data Stored in Tablespaces <br />
• Log of changes stored in circular log files <br />
• Data, Indexes Pages cached in “Buffer Pool” <br />
8 <br />
www.percona.com
Physical Structure of InnoDB Tablespace and Logs <br />
STORAGE FILES LAYOUT <br />
9 <br />
www.percona.com
InnoDB Tablespaces <br />
• All data stored in Tablespaces <br />
– Changes to these databases stored in Circular Logs <br />
– Changes has to be reflected in tablespace before log <br />
record is overwriSen <br />
• Single tablespace or mul?ple tablespaces <br />
– innodb_file_per_table=1 <br />
– Mul?ple tablespace support is enabled by default <br />
• System informa?on is always in main tablespace <br />
– Main tablespace can consist of many files <br />
• They are concatenated <br />
10 <br />
www.percona.com
Tablespace Format <br />
• Collec?on of Segments <br />
– Segment is like a “file” <br />
• Segment is number of extents <br />
– Typically 64 of 16K page sizes <br />
– Smaller extents for very small objects <br />
• First Tablespace page contains header <br />
– Tablespace size <br />
– Tablespace id <br />
11 <br />
www.percona.com
Types of Segments <br />
• Each table is Set of Indexes <br />
– InnoDB has “index organized tables” <br />
• Each index has <br />
– Leaf node segment <br />
– Non Leaf node segment <br />
• Special Segments <br />
– Rollback Segment(s) <br />
– Insert buffer, etc. <br />
12 <br />
www.percona.com
InnoDB On-‐Disk Layout <br />
13 <br />
www.percona.com
InnoDB Space Alloca7on <br />
• Small Segments (less than 32 pages) – by page <br />
• Large Segments <br />
– Extent at the ?me (to avoid fragmenta?on) <br />
• Free pages recycled within same segment <br />
• All pages in extent must be free before it is used <br />
in different segment of same tablespace <br />
– innodb_file_per_table=1 -‐ free space can be used by <br />
same table only <br />
• InnoDB never shrinks its tablespaces <br />
14 <br />
www.percona.com
InnoDB Page Structure <br />
15 <br />
www.percona.com
Storage Tuning Parameters <br />
• innodb_file_per_table <br />
– Store each table in its own file/tablespace <br />
• innodb_autoextend_increment <br />
– Extend all tablespaces in this increment <br />
– Extension of tablespaces can happen concurrently <br />
• innodb_page_size <br />
– May be set to 4K, 8K or 16K <br />
– Must be set before the database is created <br />
16 <br />
www.percona.com
Using File per Table <br />
• Typically more convenient <br />
• Reclaim space from dropped table <br />
• OPTIMIZE TABLE mytable; <br />
– reduce file size aler data was deleted <br />
• Store different tables/databases on different drives <br />
• Backup/Restore tables one by one <br />
• Support for compression <br />
• Will use more space with many tables <br />
• Longer unclean restart ?me with many tables <br />
17 <br />
www.percona.com
<strong>Performance</strong> and InnoDB File Per <br />
Table <br />
• <strong>Performance</strong> is Similar in majority of cases <br />
• Very large number of tables is a problem <br />
– Can use a mix of tables in their own tablespaces <br />
and tables in shared tablespace <br />
• Can help with i-‐node level locking on some <br />
filesystems <br />
– EXT3 <br />
18 <br />
www.percona.com
Drop Table with <br />
innodb_file_per_table <br />
• Dropping tablespace can get expensive <br />
– Even empty one! <br />
• DROP TABLE; TRUNCATE TABLE; some ALTER <br />
• Nega?ve Scalability <br />
– More expensive with larger buffer pool size <br />
• Significant improvements in MySQL 5.6 <br />
• More work in MySQL 5.7 <br />
– Especially for temporary tables <br />
19 <br />
www.percona.com
Dealing with Runaway tablespace <br />
• Main Tablespace does not shrink <br />
– Consider sepng max size <br />
–<br />
innodb_data_file_path=ibdata1:10M:autoextend:<br />
max:10G <br />
• Dump and Restore <br />
• Use Transportable Tablespaces feature <br />
20 <br />
www.percona.com
Transportable Tablespaces <br />
• Copy tables from one loca?on to another <br />
• FLUSH TABLES FOR EXPORT <br />
• Copy .frm, .ibd, .cfg file to a backup loca?on <br />
• UNLOCK TABLES <br />
• Import the tablespace at a different loca?on <br />
– But make sure that the table defini?on is the same on <br />
the second loca?on <br />
• Or use Percona XtraBackup hSp://goo.gl/MVGfIS <br />
21 <br />
www.percona.com
Separate Undo Tablespace <br />
• Store undo tablespace in separate set of files <br />
– innodb_undo_directory <br />
• Path where external undo tablespace are stored <br />
• Should be SSD backed for best performance <br />
– innodb_undo_tablespaces <br />
• How many undo tablespaces to create <br />
• Must be set at database crea?on ?me and cannot be <br />
changed <br />
• The tablespace size grows and never shrinks <br />
• Beware of running long running transac?ons <br />
22 <br />
www.percona.com
Separate Undo Tablespace (cont.) <br />
• Store undo tablespace in separate set of files <br />
– innodb_undo_logs <br />
• earlier innodb_rollback_segments <br />
23 <br />
www.percona.com
Page Checksums <br />
• Protec?on from corrupted data <br />
– Bad hardware, OS Bugs, InnoDB Bugs <br />
– Are not completely replaced by filesystem checksums <br />
• Checked when page is Read to Buffer Pool <br />
• Updated when page is flushed to disk <br />
• Can be significant overhead <br />
– Especially for very fast storage <br />
• Can be disabled by innodb_checksums=0 <br />
24 <br />
www.percona.com
Page Checksums (cont.) <br />
• For Fast Storage you might use faster <br />
checksums <br />
– innodb_checksum_algorithm=crc32 <br />
25 <br />
www.percona.com
What threads are there and what they do <br />
THREADS ARCHITECTURE <br />
26 <br />
www.percona.com
General Thread <strong>Architecture</strong> <br />
• Using MySQL Threads for execu?on <br />
– Normally thread per connec?on <br />
• Transac?on executed mainly by such thread <br />
– LiSle benefit from Mul?-‐Core for single query <br />
• innodb_thread_concurrency can be used to limit <br />
number of execu?ng threads <br />
– This limit is number of threads in kernel <br />
• Including threads doing Disk IO or storing data in TMP Table. <br />
– Limits concurrency but reduces conten?on <br />
27 <br />
www.percona.com
General Thread <strong>Architecture</strong> (cont.) <br />
• innodb_thread_sleep_delay can be used to <br />
control thread queuing <br />
– This limit controls for how long threads sleep <br />
before joining the queue of wai?ng threads <br />
– Only used if innodb_thread_concurrency > 0 <br />
– innodb_adap7ve_max_sleep_delay <br />
• Makes thread sleep delay more adap?ve <br />
• InnoDB can adap?vely tune <br />
innodb_thread_sleep_delay, based on this <br />
28 <br />
www.percona.com
Helper Threads <br />
• Main Thread <br />
– Schedules background ac?vi?es <br />
• Change buffer merges, possible purge, etc. <br />
• IO Threads <br />
– Read -‐ mul?ple threads used for read ahead <br />
• innodb_read_io_threads <br />
– Write -‐ mul?ple threads used for background writes <br />
• innodb_write_io_threads <br />
– Insert Buffer thread used for change buffer merges <br />
29 <br />
www.percona.com
Helper Threads (cont.) <br />
• Page Cleaner Thread <br />
• Purge thread(s) <br />
• Deadlock detec?on thread <br />
• <strong>And</strong> others <br />
30 <br />
www.percona.com
ThreadPool <br />
• Helpful for handling very large number of <br />
connec?ons <br />
– 10K+ connec?ons are possible <br />
• Available in MySQL Enterprise, Percona Server, <br />
MariaDB <br />
• Implementa?ons are a bit different <br />
31 <br />
www.percona.com
How Indexes are Implemented in InnoDB <br />
INDEXES <br />
32 <br />
www.percona.com
Everything is the Index <br />
• InnoDB tables are “Index Organized” <br />
– PRIMARY KEY contains data instead of data pointer <br />
• Hidden PRIMARY KEY is used if not defined (6b) <br />
– Hidden PRIMARY KEY is a global conten?on point <br />
• Data is “Clustered” by PRIMARY KEY <br />
– Data with close PK value is stored close to each other <br />
– Clustering is within page ONLY <br />
• InnoDB system tables INNODB_SYS_TABLES and <br />
INNODB_SYS_INDEXES expose informa?on <br />
33 <br />
www.percona.com
INNODB_SYS_TABLES Example <br />
• Gives good overview informa?on <br />
mysql (information_schema) > select * from INNODB_SYS_TABLES limit 10;<br />
+----------+----------------------------+------+--------+-------+-------------+------------+---------------+<br />
| TABLE_ID | NAME | FLAG | N_COLS | SPACE | FILE_FORMAT | ROW_FORMAT | ZIP_PAGE_SIZE |<br />
+----------+----------------------------+------+--------+-------+-------------+------------+---------------+<br />
| 14 | SYS_DATAFILES | 0 | 5 | 0 | Antelope | Redundant | 0 |<br />
| 11 | SYS_FOREIGN | 0 | 7 | 0 | Antelope | Redundant | 0 |<br />
| 12 | SYS_FOREIGN_COLS | 0 | 7 | 0 | Antelope | Redundant | 0 |<br />
| 13 | SYS_TABLESPACES | 0 | 6 | 0 | Antelope | Redundant | 0 |<br />
| 16 | mysql/innodb_index_stats | 1 | 11 | 2 | Antelope | Compact | 0 |<br />
| 15 | mysql/innodb_table_stats | 1 | 9 | 1 | Antelope | Compact | 0 |<br />
| 18 | mysql/slave_master_info | 1 | 26 | 4 | Antelope | Compact | 0 |<br />
| 17 | mysql/slave_relay_log_info | 1 | 11 | 3 | Antelope | Compact | 0 |<br />
| 19 | mysql/slave_worker_info | 1 | 15 | 5 | Antelope | Compact | 0 |<br />
+----------+----------------------------+------+--------+-------+-------------+------------+---------------+<br />
9 rows in set (0.00 sec)<br />
34 <br />
www.percona.com
INNODB_SYS_INDEXES Example <br />
• Gives good overview informa?on <br />
mysql (information_schema) > select * from INNODB_SYS_INDEXES where table_id in (15,<br />
16);<br />
+----------+---------+----------+------+----------+---------+-------+<br />
| INDEX_ID | NAME | TABLE_ID | TYPE | N_FIELDS | PAGE_NO | SPACE |<br />
+----------+---------+----------+------+----------+---------+-------+<br />
| 17 | PRIMARY | 15 | 3 | 2 | 3 | 1 |<br />
| 18 | PRIMARY | 16 | 3 | 4 | 3 | 2 |<br />
+----------+---------+----------+------+----------+---------+-------+<br />
2 rows in set (0.00 sec)<br />
35 <br />
www.percona.com
Index Structure <br />
• Secondary Indexes refer to rows by Primary Key <br />
– No update when row is moved to different page <br />
• Long Primary Keys are expensive <br />
– Increase size of all indexes <br />
• Random Primary Key Inserts are expensive <br />
– Cause page splits; Fragmenta?on <br />
– Make page space u?liza?on low <br />
• Auto-‐Increment keys are olen beSer than <br />
ar?ficial keys, UUIDs, SHA1 etc. <br />
36 <br />
www.percona.com
More on Clustered Index <br />
• PRIMARY KEY lookups are the most efficient <br />
– Secondary key lookup is essen?ally 2 key lookups <br />
• Op?mized with Adap?ve Hash Index <br />
• PRIMARY KEY ranges are very efficient <br />
– Build Schema keeping it in mind <br />
– (user_id,message_id) may be beSer than (message_id) <br />
• Changing PRIMARY KEY is expensive <br />
– Effec?vely removing row and adding new one <br />
– Effects all secondary indexes <br />
• Sequen?al Inserts give compact, least fragmented <br />
storage <br />
37 <br />
www.percona.com
More on Indexes <br />
• There is no Prefix Index compressions <br />
– Index can be 10x larger than for MyISAM table <br />
– InnoDB has page compression. Not the same thing. <br />
• Indexes contain transac?on informa?on = fat <br />
– Allow to see row visibility = index covering queries <br />
• Fast Index Crea?on <br />
– Secondary indexes built by sor?ng <br />
– Good page fill factor and no fragmenta?on <br />
– innodb_sort_buffer_size <br />
38 <br />
www.percona.com
Fragmenta7on <br />
• Inter-‐row fragmenta?on <br />
– The row itself is fragmented <br />
– Happens in MyISAM but NOT in InnoDB <br />
• Intra-‐row fragmenta?on <br />
– Sequen?al scan of rows is not sequen?al <br />
– Happens in InnoDB, outside of page boundary <br />
• Empty Space Fragmenta?on <br />
– A lot of empty space can be lel between rows <br />
39 <br />
www.percona.com
Fragmenta7on (cont.) <br />
• Defragmenta?on can be done by OPTIMIZE TABLE <br />
– OPTIMIZE TABLE mytable; <br />
• Cannot be done online in MySQL <br />
• Can be done online using pt-‐online-‐schema-‐change <br />
hSp://goo.gl/pKBV2R <br />
pt-online-schema-change --alter<br />
"ENGINE=InnoDB" D=sakila,t=actor<br />
40 <br />
www.percona.com
Structure and working of InnoDB in-‐memory storage area <br />
BUFFER POOL <br />
41 <br />
www.percona.com
What is the Buffer Pool? <br />
• In-‐memory storage area that caches data and <br />
indexes <br />
• The unit of data storage is pages <br />
• All data access requests first go to the buffer pool <br />
• Data not found is loaded from disk and copied to <br />
it <br />
• In-‐memory data access is very fast as compared <br />
to disk IO <br />
• Data modifica?ons are also cached and then <br />
synced to disk in the background <br />
42 <br />
www.percona.com
Important Structures <br />
• LRU List <br />
– List of pages in the buffer pool ordered by last access <br />
?me <br />
– Uses a modified LRU algorithm <br />
– Actually maintains two lists <br />
• List of young block <br />
• List of old blocks <br />
– Addi?onal list when dealing with compressed tables <br />
• unzip_LRU list -‐ maintains a list of uncompressed pages for <br />
compressed tables <br />
43 <br />
www.percona.com
Important Structures (cont.) <br />
• Flush List <br />
– List of dirty pages in the buffer pool <br />
– Dirty pages are pages that contain modified rows <br />
and that need to be synced to data files on disk <br />
– Pages are moved from flush list to LRU list when <br />
changes have been synced to disk <br />
– Flush list size is capped by the redo log size (among <br />
other limits) <br />
44 <br />
www.percona.com
Configuring the Buffer Pool <br />
• innodb_buffer_pool_size is most important <br />
op?on <br />
– Use all your memory nor commiSed to anything else <br />
– Keep overhead into account (~5%) <br />
– Never let Buffer Pool Swapping to happen <br />
– Up to 80-‐90% of memory on InnoDB only Systems <br />
– innodb_buffer_pool_instances=N <br />
• Set to 4-‐16 depending number of cores to get beSer <br />
scalability <br />
• Default is 8 instances <br />
• Percona Server does well with few instances <br />
45 <br />
www.percona.com
Buffer Pool Page Replacement <br />
• InnoDB uses LRU for page replacement <br />
– With Midpoint Inser?on <br />
• innodb_old_blocks_pct, innodb_old_blocks_7me <br />
– Offers Scan resistance from large full table scans <br />
– innodb_old_blocks_7me defaults to 1000ms now <br />
meaning pages read in by scans age out faster <br />
• Page Cleaner thread maintains clean pages in free <br />
list <br />
• Evic?on is done from the list of old blocks <br />
46 <br />
www.percona.com
Buffer Pool Page Replacement (cont.) <br />
• Compressed / uncompressed pages evic?on <br />
– If size of unzip_LRU list
Buffer Pool Page Flushing <br />
• Different flushing flavors <br />
– Background flushing <br />
• Adap?ve, async, innodb_max_dirty_pages_pct, LRU <br />
and idle flushing <br />
– Foreground Sync flushing <br />
• Done by query thread that detects the condi?on <br />
• All other query threads are halted <br />
48 <br />
www.percona.com
Neighbor Page Flushing <br />
• InnoDB looks for next and previous dirty pages <br />
and flushes them as well to keep IO more <br />
bulky <br />
• Controlled by innodb_flush_neighbors <br />
• This is a good op?miza?on for HDD <br />
• Such behavior may be a poor choice <br />
– Especially for SSD which does not have random IO <br />
penalty <br />
– Disable it with innodb_flush_neighbors=0 <br />
49 <br />
www.percona.com
Buffer Pool Dump and Restore <br />
• Ability to dump Buffer Pool contents allowing <br />
for warm cache on restarts <br />
• Automa?cally dump and reload on shutdown <br />
and restart <br />
– innodb_buffer_pool_dump_at_shutdown <br />
– innodb_buffer_pool_load_at_startup <br />
• On-‐demand dump and reload <br />
– innodb_buffer_pool_dump_now <br />
– innodb_buffer_pool_load_now <br />
50 <br />
www.percona.com
What’s in the Buffer Pool? <br />
• Informa?on about Buffer Pool pages and <br />
sta?s?cs is available in I_S tables <br />
– INNODB_BUFFER_PAGE_LRU <br />
– INNODB_BUFFER_PAGE <br />
– INNODB_BUFFER_POOL_STATS <br />
mysql > select POOL_ID, POOL_SIZE, FREE_BUFFERS, DATABASE_PAGES, OLD_DATABASE_PAGES, MODIFIED_DATABASE_PAGES from<br />
INNODB_BUFFER_POOL_STATS;<br />
+---------+-----------+--------------+----------------+--------------------+-------------------------+<br />
| POOL_ID | POOL_SIZE | FREE_BUFFERS | DATABASE_PAGES | OLD_DATABASE_PAGES | MODIFIED_DATABASE_PAGES |<br />
+---------+-----------+--------------+----------------+--------------------+-------------------------+<br />
| 0 | 8191 | 8043 | 148 | 0 | 0 |<br />
+---------+-----------+--------------+----------------+--------------------+-------------------------+<br />
1 row in set (0.00 sec)<br />
51 <br />
www.percona.com
What’s in the Buffer Pool? (cont.) <br />
• InnoDB Status also gives high level picture of <br />
the buffer pool <br />
– Percona Server exposes more informa?on <br />
– SHOW ENGINE INNODB STATUS\G <br />
----------------------<br />
BUFFER POOL AND MEMORY<br />
----------------------<br />
Total memory allocated 137363456; in additional pool allocated 0<br />
Dictionary memory allocated 43128<br />
Buffer pool size 8191<br />
Free buffers 8043<br />
Database pages 148<br />
Old database pages 0<br />
Modified db pages 0<br />
Pending reads 0<br />
Pending writes: LRU 0, flush list 0, single page 0<br />
Pages made young 0, not young 0<br />
0.00 youngs/s, 0.00 non-youngs/s<br />
Pages read 148, created 0, written 1<br />
0.00 reads/s, 0.00 creates/s, 0.00 writes/s<br />
No buffer pool page gets since the last printout<br />
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s<br />
LRU len: 148, unzip_LRU len: 0<br />
I/O sum[0]:cur[0], unzip sum[0]:cur[0]<br />
52 <br />
www.percona.com
Data Dic7onary <br />
• Holds informa?on about InnoDB Tables <br />
– Sta?s?cs; Auto-‐increment value, System informa?on <br />
– Can be 4KB to 10KB per table <br />
• Can consume a lot of memory with huge number <br />
of tables <br />
– Think hundreds of thousands <br />
• Data dic?onary LRU <br />
– Controlled automa?cally by table_defini7on_cache <br />
• If you have many tables increasing it is important <br />
53 <br />
www.percona.com
Buffer Pool Improvements in Percona <br />
Server <br />
• Lot of work in Percona Server to improve <br />
buffer pool scalability <br />
• Different mutex for different jobs reduce <br />
global conten?on <br />
– LRU list mutex <br />
– Flush list mutex <br />
– Free list mutex <br />
– and others .. hSp://goo.gl/Ln1Z9a <br />
54 <br />
www.percona.com
How does it func?on and what work does it do? <br />
PAGE CLEANER THREAD <br />
55 <br />
www.percona.com
Page Cleaner Thread <br />
• Handles all types of background flushing <br />
– Flushes pages from end of LRU list <br />
– Flushes pages from flush list <br />
• Wakes up once per second <br />
– Skips sleeping if server is idle <br />
• Server will do merges and flushes faster when it is <br />
idle <br />
• In some cases flushing can be done from user <br />
threads <br />
– Percona Server allows you to disable this <br />
56 <br />
www.percona.com
LRU List Flushing <br />
• Can happen in workloads with data sets larger <br />
than memory <br />
• Flushes dirty pages from the end of the LRU <br />
list <br />
• Moves clean pages at the tail of the LRU to the <br />
free list <br />
• Page cleaner does LRU flushing in all cases <br />
with some excep?ons <br />
57 <br />
www.percona.com
LRU List Flushing (cont.) <br />
• innodb_lru_scan_depth=N <br />
– How deeply to examine tail for dirty pages <br />
– User thread may scan up to this depth as well if no <br />
page available in free list <br />
– Important to tune to prevent synchronous flushes <br />
58 <br />
www.percona.com
Synchronous LRU List Flushing <br />
• Synchronous flush from user thread if it cannot <br />
find a clean page in the free list <br />
• Flushes a single page from user thread in case <br />
of synchronous flush <br />
• Typically when page cleaner thread is not able <br />
to keep up with page dirtying rate <br />
59 <br />
www.percona.com
Flushing Overview <br />
• Hard Problem to Balance <br />
– Flush too much now <br />
• Compete with IO which must happen <br />
• Loose possibility of op?miza?on <br />
– Delay too much <br />
• Might have to do too much IO in the future <br />
• Poten?ally cause “stalls” or performance dips <br />
• Lots of con?nuing work <br />
• Too Many tuning op?ons to play with <br />
60 <br />
www.percona.com
Flush List Flushing / Adap7ve Flushing <br />
• Flushing to advance “earliest modify LSN” <br />
– To free log space so it can be reduced <br />
• Mostly done from Page Cleaner from the end <br />
of the flush list <br />
– Approximately done once per second <br />
• Most of writes typically happen this way <br />
• Number of pages to flush per cycle depends on <br />
the load <br />
61 <br />
www.percona.com
Some Variables <br />
• innodb_io_capacity <br />
• innodb_io_capacity_max <br />
• innodb_max_dirty_pages_pct <br />
• innodb_max_dirty_pages_pct_lwm <br />
• innodb_flushing_avg_loops <br />
62 <br />
www.percona.com
Synchronous Flush List Flushing <br />
• Synchronous flushing outside of Page Cleaner <br />
in cases when sync point is reached <br />
• Synchronous flushing is done from user thread <br />
which detects the condi?on <br />
• Every other user thread is blocked to prevent <br />
increase in checkpoint age <br />
• Causes system-‐wide stalls <br />
63 <br />
www.percona.com
Percona Server Improvements <br />
• Focus on BeSer default behavior with no <br />
tuning needed <br />
– Increase Priority of Page Cleaner Thread <br />
– Reduce flushing from user level threads <br />
– Make algorithm more adap?ve to changes in <br />
workload <br />
– BeSer balancing on processing mul?ple buffer <br />
pools <br />
64 <br />
www.percona.com
Some <strong>Performance</strong> Gains <br />
65 <br />
www.percona.com
What does this structure do? <br />
ADAPTIVE HASH INDEX <br />
66 <br />
www.percona.com
What is Adap7ve Hash Index? <br />
• On-‐demand hash index built automa?cally by <br />
InnoDB <br />
• Built on top of exis?ng B+TREE Indexes to <br />
speed up lookups <br />
– Both PRIMARY and Secondary indexes <br />
• Can be built for full index and prefixes <br />
• Par?al Index <br />
– Only built for index values which are accessed <br />
olen <br />
67 <br />
www.percona.com
What is Adap7ve Hash Index? (cont.) <br />
• Adap?ve Hash Index lookup <br />
Search the<br />
adaptive hash<br />
extension_number<br />
1<br />
4<br />
2 6<br />
8<br />
Find a pointer directly to<br />
12 the leaf node of the<br />
clustered index.<br />
10 14<br />
3 5 7 9 11 13 15<br />
4 0xACDC<br />
12 0xACDC<br />
12 0xACDC<br />
2 0xACDC<br />
6 0xACDC 10 0xACDC 14 0xACDC<br />
1 ..<br />
3 .. 5 .. 7 .. 9 ..<br />
11 ..<br />
13 ..<br />
15 ..<br />
68 <br />
www.percona.com
Tuning Adap7ve Hash Index <br />
• No tuning op?on available <br />
– Can either be enabled or disabled <br />
– Enabled by default <br />
• Can be disabled for performance reasons <br />
– innodb_adap7ve_hash_index <br />
– Improves concurrency but reduces performance <br />
• Can be Par??oned in Percona Server <br />
– innodb_adap7ve_hash_index_par77ons=8 <br />
69 <br />
www.percona.com
What is change buffering in InnoDB? <br />
CHANGE BUFFER <br />
70 <br />
www.percona.com
What is Change Buffer? <br />
• Buffer INSERT, UPDATE and PURGE opera?ons <br />
– DELETE is covered as a special UPDATE <br />
• Designed to speed up modifica?ons into <br />
secondary Indexes when index page not in buffer <br />
pool <br />
– Secondary index updates are expensive <br />
– Reported up to 15 ?mes IO reduc?on for some cases <br />
• Works for Non-‐Unique Secondary Indexes only <br />
• If leaf index page is not in buffer pool <br />
– Store a note the page should be updated in memory <br />
71 <br />
www.percona.com
What is Change Buffer ? (cont.) <br />
• If page containing buffered entries is read <br />
from disk they are merged transparently <br />
• InnoDB performs gradual change buffer <br />
merges in background <br />
– Change buffer merges may con?nue across <br />
restarts <br />
– InnoDB slow shutdown forces complete change <br />
buffer merge on shutdown <br />
• innodb_fast_shutdown=0 <br />
72 <br />
www.percona.com
Tuning Change Buffer <br />
• Can take up to ¼ the size of the buffer pool <br />
– Size of the change buffer can be controlled via <br />
innodb_change_buffer_max_size <br />
• Background merge speed may not be enough <br />
– Tune innodb_io_capacity <br />
• You can disable change buffering <br />
– innodb_change_buffering=none <br />
– Can be good for SSDs <br />
– Can also be good if dataset fits in memory <br />
73 <br />
www.percona.com
Change Buffering Overhead <br />
• Full change buffer is useless and wastes <br />
memory <br />
• Delayed change buffering may cause <br />
slowdown <br />
– Too many merges need to happen on page read <br />
• Aler restart merge speed may slow down <br />
– Finding index entries to merge requires random IO <br />
74 <br />
www.percona.com
How is this structure used in InnoDB <br />
DOUBLEWRITE BUFFER <br />
75 <br />
www.percona.com
What is Doublewrite Buffer? <br />
• InnoDB redo logs requires consistent pages for <br />
recovery <br />
• Page write may only be par?ally completed <br />
– Upda?ng part of 16K and leaving the rest <br />
• Doublewrite Buffer is short term page level log <br />
• All page writes go to doublewrite buffer first: <br />
– Write pages to doublewrite buffer; Sync <br />
– Write Pages to their original loca?ons; Sync <br />
– Pages contain tablespace_id+page_id <br />
76 <br />
www.percona.com
How DoubleWrite helps recovery <br />
• On crash recovery pages in this buffer are <br />
compared to their original loca?on <br />
– At least one of them should be conistent <br />
77 <br />
www.percona.com
What is Doublewrite Buffer? (cont.) <br />
• Doublewrite Buffer usage: <br />
Sync to double<br />
write area.<br />
Buffer<br />
Buffer Pool<br />
Sync individual<br />
pages to correct<br />
destinations.<br />
Tablespace<br />
78 <br />
www.percona.com
Tuning Doublewrite Buffer <br />
• No par?cular tuning op?ons available <br />
– Either enable it or disable it <br />
– innodb_doublewrite <br />
• Not much overhead on tradi?onal disks because <br />
writes are sequen?al <br />
• Much more overhead on SSDs <br />
– Also reduces their lifespan <br />
• Makes sense to disable it on atomic writes <br />
guarantee <br />
79 <br />
www.percona.com
Tuning Doublewrite Buffer (cont.) <br />
• Atomic writes guarantees <br />
– ZFS filesystem <br />
– DirectFS filesystem on Fusion-‐io devices <br />
• Percona Server has support <br />
• hSp://goo.gl/ylpDK1 <br />
80 <br />
www.percona.com
What is it and how does it work? <br />
REDO LOGGING <br />
81 <br />
www.percona.com
InnoDB Log Files <br />
• Set of log files (ib_logfile?) <br />
– 2 log files by default. Effec?vely concatenated <br />
• Log Header <br />
– Stores informa?on about last checkpoint <br />
• Log is NOT organized in pages, but records <br />
– Records aligned 512 bytes, matching disk sector <br />
• Log record format “physiological” <br />
– Stores Page# and opera?on to do on it <br />
• Only REDO opera?ons are stored in logs. <br />
82 <br />
www.percona.com
More on Log Files <br />
• The log files are used to perform recovery in case <br />
of an unclean shutdown <br />
– Larger log files -‐ Longer recovery <br />
• There is virtually no limit on the size of the log <br />
files <br />
– Well the combined size of the log files cannot exceed <br />
~512GB <br />
• Percona Server allows different log file block size <br />
– innodb_log_block_size <br />
• Default value is 512 bytes and good in most general cases <br />
• 4096 can be beSer for SSDs <br />
83 <br />
www.percona.com
More on Log Files (cont.) <br />
• Percona Server adds <br />
– innodb_log_checksum_algorithm=crc32 <br />
• Reduce CPU load for heavy writes to log file <br />
• If you're using compressed pages full pages <br />
can be logged to log file. <br />
84 <br />
www.percona.com
More on Log Files (cont.) <br />
• InnoDB Logs usage <br />
01010<br />
Log Files<br />
UPDATE City SET<br />
name = 'Morgansville'<br />
WHERE name = 'Brisbane'<br />
AND CountryCode='AUS'<br />
Buffer Pool<br />
Tablespace<br />
85 <br />
www.percona.com
InnoDB Log IO <br />
• Log are opened in buffered mode <br />
– Even with innodb_flush_method=O_DIRECT <br />
– Percona Server use O_DIRECT for logs as well <br />
• innodb_flush_method=ALL_O_DIRECT <br />
• innodb_log_block_size may need to be adjusted for EXT4 <br />
filesystem for ALL_O_DIRECT to work <br />
• Flushed by fsync() -‐ default or O_SYNC <br />
• Logs which fit in cache may improve performance <br />
– Small transac?ons and <br />
innodb_flush_log_at_trx_commit=1 or 2 <br />
86 <br />
www.percona.com
Resizing Log Files <br />
• Log files are automa?cally resized <br />
– If innodb_log_file_size is changed between <br />
restarts <br />
– Log files are automa?cally resized to match the <br />
new desired size <br />
• Automa?c resizing does not work with <br />
versions < 5.6 <br />
87 <br />
www.percona.com
Tuning Parameters <br />
• Total log file size is calculated as <br />
– innodb_log_file_size * innodb_log_files_in_group <br />
• innodb_log_file_size <br />
– A general principle is to have these big enough to <br />
hold one hour worth of changes <br />
– Extremely important to tune this for write <br />
performance <br />
• innodb_log_files_in_group <br />
– No good reason to change <br />
88 <br />
www.percona.com
Log Buffer <br />
• Redo log records are first wriSen to a log <br />
buffer in memory <br />
• innodb_flush_log_at_7meout=N <br />
– Log buffer is synced to disk every N seconds <br />
irrespec?vely anyhow <br />
– Once per second by default <br />
89 <br />
www.percona.com
Log Buffer (cont.) <br />
• innodb_log_buffer_size <br />
– Values 32-‐256MB typically make sense <br />
– Larger values are good to reduce conten?on <br />
– May need to be larger if using large BLOBs <br />
– See amount of data wriSen to the logs <br />
– Log buffer covering 10sec is good enough <br />
90 <br />
www.percona.com
Log Buffer (cont.) <br />
• innodb_flush_log_at_trx_commit <br />
Log Files<br />
Buffer Pool<br />
Log Buffer OS Cache RAID Cache<br />
Physical Storage<br />
1 fsync every commit<br />
2<br />
fsync once per innodb_flush_log_at_7meout<br />
second<br />
0<br />
fsync once per innodb_flush_log_at_7meout<br />
second<br />
91 <br />
www.percona.com
How InnoDB recovers from crash? <br />
RECOVERY <br />
92 <br />
www.percona.com
Recovery Stages <br />
• Physical Recovery <br />
– Recover par?ally wriSen pages from doublewrite <br />
buffer <br />
• Redo Recovery <br />
– Redo all the changes stored in transac?onal logs <br />
• Undo Recovery <br />
– Rollback uncommiSed transac?ons <br />
93 <br />
www.percona.com
Redo Recovery Stage <br />
• Foreground <br />
– Server is not started un?l it is complete <br />
• Larger Logs = Longer recovery ?me <br />
– Though row sizes, database size, workload also maSer <br />
• Scan Log files <br />
– Buffer modifica?ons on per page basics <br />
– Apply modifica?ons to data file <br />
• LSN stored in the page tells if change needs to be <br />
applied <br />
94 <br />
www.percona.com
Tuning Redo Recovery <br />
• Redo log size maSers <br />
– large logs longer recovery <br />
– tune innodb_log_file_size accordingly <br />
• innodb_max_dirty_pages_pct <br />
– Fewer the dirty pages, faster the recovery <br />
• innodb_buffer_pool_size <br />
– Larger buffer pool means faster IO recovery <br />
95 <br />
www.percona.com
Undo Recovery Stage <br />
• Is done in the background <br />
– Performed aler MySQL is started <br />
• Speed depends on transac?on length <br />
– Very large UPDATE, INSERT... SELECT is problem. <br />
• Is NOT a problem with ALTER TABLE <br />
– Commits every 10000 rows to avoid this problem <br />
– Unless it is Par??oned table <br />
• Faster with larger innodb_log_file_size <br />
• Be careful killing MySQL with runaway update <br />
queries. <br />
96 <br />
www.percona.com
Implementa?on of Mul? Versioning <br />
MULTI VERSIONING <br />
97 <br />
www.percona.com
Mul7 Versioning Overview <br />
• InnoDB maintains old row versions of modified <br />
rows <br />
• Read Transac?on can read old version of row, <br />
while it is modified <br />
– Prevents the need to do locking reads <br />
• Locking reads can s?ll be performed with <br />
SELECT FOR UPDATE and LOCK IN SHARE <br />
MODE Modifiers <br />
98 <br />
www.percona.com
Updates and Locking Reads <br />
• Updates bypass Mul? Versioning <br />
– You can only modify row which currently exists <br />
• Locking Read bypass mul?-‐versioning <br />
– Result from SELECT vs. SELECT .. LOCK IN SHARE <br />
MODE will be different <br />
• Locking Reads are slower <br />
– Because they have to set locks <br />
– Can be 2x+ slower! <br />
– SELECT FOR UPDATE has larger overhead <br />
99 <br />
www.percona.com
Mul7 Versioning Implementa7on <br />
• The most recent row version is stored in the page <br />
– Even before it is commiSed <br />
• Previous row versions stored in undo space <br />
• The number of versions stored is not limited <br />
– Can cause system/undo tablespace size to explode <br />
• Access to old versions require going through <br />
linked list <br />
– Long transac?ons with many concurrent updates can <br />
impact performance. <br />
100 <br />
www.percona.com
Mul7 Versioning Internals <br />
• Each row in the database has <br />
– DB_TRX_ID (6b) – Transac?on inserted/updated row <br />
– DB_ROLL_PTR (7b) -‐ Pointer to previous version <br />
• Dele?on handled as Special Update <br />
• DB_TRX_ID + list of currently running transac?ons <br />
is used to check which version is visible <br />
• Insert and Update Undo Segments <br />
– Inserts history can be discarded when transac?on <br />
commits. <br />
– Update history is used for MVCC implementa?on <br />
101 <br />
www.percona.com
Mul7 Versioning <strong>Performance</strong> <br />
• Only changed columns stored in the undo <br />
segment <br />
• Short rows are faster to update <br />
– Separate table to store counters olen make sense <br />
• Beware of long transac?ons <br />
– Especially containing many updates <br />
• “Rows Read” can be misleading <br />
– Single row may correspond to scanning thousand <br />
of versions/index entries <br />
102 <br />
www.percona.com
Mul7 Versioning Indexes <br />
• Indexes contain pointers to all versions <br />
– Index key 5 will point to all rows which were 5 in <br />
the past <br />
• Indexes contain TRX_ID <br />
– Easy to check entry is visible <br />
– Can use “Covering Indexes” <br />
• Many old versions is performance problem <br />
– Slow down accesses <br />
– Will leave many “holes” in pages when purged <br />
103 <br />
www.percona.com
Dealing with Runaway Transac7ons <br />
• Monitor SHOW ENGINE INNODB STATUS for <br />
transac?ons ACTIVE for long ?me <br />
– Looking for large InnoDB History List Length is also <br />
good idea <br />
• <strong>Innodb</strong>_history_list_length status variable <br />
• Percona Server has feature to kill idle transac?ons <br />
– Idle transac?ons are ones that are open but inac?ve <br />
– innodb_kill_idle_transac7on hSp://goo.gl/poBUVi <br />
104 <br />
www.percona.com
Transac7on Isola7on Modes <br />
• SERIALIZABLE <br />
– Locking reads. Bypass mul? versioning <br />
• REPEATABLE-‐READ (default) <br />
– Read commiSed data at it was on start of transac?on <br />
• READ-‐COMMITTED <br />
– Read commiSed data as it was at start of statement <br />
• READ-‐UNCOMMITTED <br />
– Read uncommiSed data as it is changing live <br />
105 <br />
www.percona.com
Purging the Garbage <br />
• Old row and index entries need to be removed <br />
– When they are not needed for any ac?ve transac?on <br />
• REPEATABLE READ <br />
– Need to be able to read everything at transac?on start <br />
• READ-‐COMMITTED <br />
– Need to read everything at statement start <br />
• Purging can be done by the main InnoDB thread <br />
or by dedicated thread(s) <br />
106 <br />
www.percona.com
Purging the Garbage (cont.) <br />
• Purging is done by default by a single thread <br />
– Can have between 0 -‐ 32 purge threads <br />
– A minimum of 1 purge thread is recommended <br />
– 4 is probably good for write intensive workloads <br />
– More than 1 thread may decrease throughput for IO <br />
bound workload <br />
• innodb_purge_batch_size <br />
– Default is 300 which is good in most cases <br />
– Keep this to a reasonably high value because purging <br />
many rollback segments is expensive <br />
107 <br />
www.percona.com
Purging the Garbage (cont.) <br />
• Purge Thread(s) may be unable to keep up <br />
with intensive updates <br />
– InnoDB “History List Length” will grow high <br />
– innodb_max_purge_lag slows updates down <br />
• A “History List Length” threshold that when crossed, <br />
triggers DML(s) to be throSled <br />
– innodb_max_purge_lag_delay <br />
• Controls maximum delay in milliseconds which may be <br />
added to a DML <br />
108 <br />
www.percona.com
How row-‐level locking and internal locks work? <br />
LOCKING AND LATCHING <br />
109 <br />
www.percona.com
InnoDB Locking Basics <br />
• Row level locking <br />
– No lock escala?on <br />
• Pessimis?c Locking Strategy <br />
• Row Level lock wait ?meout <br />
– innodb_lock_wait_7meout <br />
• Tradi?onal “S” and “X” locks <br />
• Inten?on locks on tables “IS” “IX” <br />
– Restric?ng table opera?ons <br />
110 <br />
www.percona.com
InnoDB Locking Basics (cont.) <br />
• Locks on Rows AND Index Records <br />
• Automa?c deadlock detec?on <br />
– Non-‐recursive <br />
– innodb_print_all_deadlocks <br />
• Prints all deadlocks to the error log <br />
– For some deadlocks you may get <br />
innodb_lock_wait_7meout error <br />
111 <br />
www.percona.com
Gap Locking <br />
• InnoDB not only locks rows but also gap between them <br />
• Needed for Consistent Reads <br />
– Also used by update statements <br />
– READ-‐COMMITTED + RBR can reduce those needs <br />
• InnoDB has no phantom rows even in Consistent Reads <br />
• Gap locking olen cause complex deadlock situa?ons <br />
• “Infinum”, “Supremum” records define bounds of data <br />
stored on the page <br />
• Only record lock is needed for PK Update <br />
112 <br />
www.percona.com
Types of Locks in InnoDB <br />
• Next-‐Key-‐Lock <br />
– Lock Key and gap before the key <br />
• Gap-‐Lock <br />
– Lock just the gap before the key <br />
• Record-‐Only-‐Lock <br />
– Lock record only <br />
• Insert inten?on gap locks <br />
– Held when wai?ng to insert into the gap <br />
113 <br />
www.percona.com
Advanced Gap Locking Stuff <br />
• Gaps can change on row dele?on <br />
– Actually when Purge thread removes record <br />
• Leaving conflic?ng Gap locks held <br />
• Gap Locks are “purely inhibi?ve” <br />
– Only block inser?on. <br />
– Holding lock does not allow inser?on. Must also wait <br />
for conflic?ng locks to be released <br />
• “supremum” record can have lock, “infinum” <br />
can't <br />
• You rarely need it in prac?ce <br />
114 <br />
www.percona.com
InnoDB Lock Storage <br />
• InnoDB locks storage is preSy compact <br />
– This is why there is no lock escala?on! <br />
• Lock space needed depends on lock loca?on <br />
– Locking sparse rows is more expensive <br />
• Each Page having locks gets bitmap allocated <br />
for it <br />
– Bitmap holds lock informa?on for all records on <br />
the page <br />
• Locks typically take 3-‐8 bits per locked row <br />
115 <br />
www.percona.com
Auto Increment Locks <br />
• Tradi?onal vs. Consecu?ve Lock Mode <br />
• Tradi?onal lock mode means taking table lock <br />
• Consecu?ve lock mode means a lightweight <br />
mutex is used instead <br />
• innodb_autoinc_lock_mode – set lock <br />
behavior <br />
– “1” -‐ Does not hold table lock for simple Inserts <br />
– “2” -‐ Does not hold lock in any case. <br />
• Only works with Row level replica?on <br />
116 <br />
www.percona.com
InnoDB Latching <br />
• InnoDB implements its own mutex and RW-‐Locks <br />
– For transparency not only <strong>Performance</strong> <br />
• Latching stats shown in SHOW INNODB STATUS <br />
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ <br />
SEMAPHORES <br />
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ <br />
OS WAIT ARRAY INFO: reserva?on count 13569 <br />
OS WAIT ARRAY INFO: signal count 11421 <br />
-‐-‐Thread 1152170336 has waited at ./../include/buf0buf.ic line 630 for 0.00 seconds the semaphore: <br />
Mutex at 0x2a957858b8 created file buf0buf.c line 517, lock var 0 <br />
-‐-‐Thread 1147709792 has waited at ./../include/buf0buf.ic line 630 for 0.00 seconds the semaphore: <br />
Mutex at 0x2a957858b8 created file buf0buf.c line 517, lock var 0 <br />
Mutex spin waits 5018, rounds 70800, OS waits 2033 <br />
RW-‐shared spins 2736, rounds 84289, OS waits 2791 <br />
RW-‐excl spins 904, rounds 111941, OS waits 3607 <br />
Spin rounds per wait: 14.11 mutex, 30.81 RW-‐shared, 123.83 RW-‐excl <br />
117 <br />
www.percona.com
Latching <strong>Performance</strong> <br />
• Has been improving over the years <br />
– Great improvements in MySQL 5.6 & XtraDB <br />
– S?ll hot spots remain <br />
• innodb_thread_concurrency <br />
– Limi?ng concurrency can reduce conten?on <br />
– Introduces conten?on on its own <br />
• innodb_sync_spin_loops <br />
– Trade Spinning for context switching <br />
– Typically limited produc?on impact <br />
118 <br />
www.percona.com
Current Hot Spots <br />
• log_mutex <br />
– Wri?ng data to the log buffer <br />
• Index-‐>lock <br />
– Lock held for dura?on of low level index modifica?on <br />
– Can be serious hot spot for heavy write workloads <br />
– Improvements in MySQL 5.7 (not a full solu?on) <br />
• Adap?ve Hash Latch <br />
– Global latch to protect access to Adap?ve Hash Index <br />
119 <br />
www.percona.com
Current Hot Spots (cont.) <br />
• Adap?ve Hash Latch <br />
– Introduces conten?on in heavy read-‐write <br />
workload <br />
– innodb_adap7ve_hash_index=0 <br />
• Can slow things down but also reduces conten?on <br />
– Percona Server tackles this by par??oning the <br />
global hash index <br />
• innodb_adap7ve_hash_index_par77ons=N <br />
• Helps when workload is spread across mul?ple tables <br />
120 <br />
www.percona.com
Where is my conten7on <br />
• <strong>Performance</strong> Schema is best way to look <br />
– Check out ps_helper <br />
+---------------------------------------------+---------+--------+<br />
| event_name |nsecs_per| seconds|<br />
+---------------------------------------------+---------+--------+<br />
| wait/synch/rwlock/innodb/index_tree_rw_lock | 19543.3 | 3456.1 |<br />
| wait/synch/mutex/innodb/log_sys_mutex | 2071.3 | 385.8 |<br />
| wait/synch/rwlock/innodb/hash_table_locks | 165.7 | 184.5 |<br />
| wait/synch/mutex/innodb/fil_system_mutex | 328.3 | 113.6 |<br />
| wait/synch/mutex/innodb/redo_rseg_mutex | 1766.4 | 84.9 |<br />
| wait/synch/rwlock/sql/MDL_lock::rwlock | 430.2 | 73.9 |<br />
| wait/synch/mutex/innodb/buf_pool_mutex | 264.5 | 72.7 |<br />
| wait/synch/rwlock/innodb/fil_space_latch | 27216.1 | 53.8 |<br />
| wait/synch/mutex/sql/THD::LOCK_query_plan | 167.0 | 50.7 |<br />
| wait/synch/mutex/innodb/trx_sys_mutex | 394.9 | 41.5 |<br />
+---------------------------------------------+---------+--------+<br />
121 <br />
www.percona.com
How are BLOBs stored in InnoDB <br />
BLOB STORAGE <br />
122 <br />
www.percona.com
Handling Blobs <br />
• Blobs are handled by InnoDB in a special way <br />
– <strong>And</strong> differently by different file formats <br />
• Small blobs <br />
– Whole row fits in ~8000 bytes stored on the page <br />
• Large Blobs <br />
– Can be stored full on external pages (Barracuda) <br />
– Can be stored par?ally on external page <br />
• First 768 bytes are stored on the page (Antelope) <br />
• InnoDB will NOT read external blobs unless they <br />
are needed by the query <br />
123 <br />
www.percona.com
Blobs in a Separate Table <br />
• It Depends :) <br />
• No need to store large blobs in separate table as <br />
they are already stored outside of the row <br />
• Storing medium size blobs (which fit on the page) <br />
in the separate table makes sense <br />
• Only split blobs to separate table if they are <br />
accessed infrequently <br />
• TODO: Would love to be able to specify BLOBs to <br />
be stored off the page <br />
124 <br />
www.percona.com
InnoDB BLOB != MySQL BLOB <br />
• MySQL Has limit of 65535 bytes per row <br />
excluding BLOB and TEXT column <br />
– This limit applies to VARCHAR columns <br />
• InnoDB limit is only 8000 (half a page) <br />
– So long VARCHAR fields may be stored as a BLOB <br />
inside InnoDB <br />
mysql> create table ai(c varchar(40000), d varchar(40000));<br />
ERROR 1118 (42000): Row size too large.<br />
The maximum row size for the used table type, not counting BLOBs, is 65535.<br />
You have to change some columns to TEXT or BLOBs<br />
125 <br />
www.percona.com
BLOB Page Alloca7on <br />
• Each BLOB Stored in separate segment <br />
– Normal alloca?on rules apply. By page when by extent <br />
– One large BLOB is faster than several medium ones <br />
– Many BLOBs can cause extreme waste <br />
• 500 byte blobs will require full 16K page if it does not fit with <br />
complete row <br />
• External BLOB is NOT updated in place <br />
– InnoDB always creates a new version <br />
• Large VARCHAR/TEXT are handled similarly to <br />
BLOB <br />
126 <br />
www.percona.com
How is compression implemented in InnoDB? <br />
COMPRESSION <br />
127 <br />
www.percona.com
Compression Basics <br />
• Requires “Barracuda” and innodb_file_per_table=1 <br />
• Per Page compression (mostly) <br />
• Uses zlib for compression <br />
– innodb_compression_level sets the zlib compression level <br />
• Default is 6 and max is 9 <br />
• ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8; <br />
– Es?mate how well the data will compress <br />
– Compressed page size of 8KB works well in most cases <br />
– Choosing too small a value can cause more page splits <br />
leading to subop?mal performance <br />
128 <br />
www.percona.com
Compression Basics (cont.) <br />
• Per page update log to avoid recompression <br />
• Both Compressed and Uncompressed page can be <br />
stored in Buffer Pool <br />
– Uncompressed pages would be evicted from the <br />
Buffer Pool first <br />
• Compression failure can be expensive <br />
– Causes InnoDB to split the page and compress each <br />
one separately <br />
– InnoDB can automa?cally add padding to prevent <br />
compression failure <br />
129 <br />
www.percona.com
Compression Considera7ons <br />
• Controlling padding to avoid compression failure <br />
– innodb_compression_failure_threshold_pct <br />
• Determines when to start adding padding to avoid <br />
compression failure <br />
– innodb_compression_pad_pct_max <br />
• Max percentage of space that can be reserved in page for <br />
whitespace padding <br />
• KEY_BLOCK_SIZE=16 <br />
– Only compress externally stored BLOBs <br />
– Can reduce size without overhead <br />
130 <br />
www.percona.com
Compression Considera7ons (cont.) <br />
• May need to provision more memory for the <br />
buffer pool to avoid frequent uncompression <br />
• May not be too good for a workload that is <br />
already CPU bound <br />
• Page size is too small for good compression <br />
• Have to “Guess” how data will compress to set <br />
the compressed block size <br />
131 <br />
www.percona.com
Compression Considera7ons (cont.) <br />
• Compression sepng is Per table <br />
– Though some indexes compress beSer than others <br />
• It may be more suitable to compress within <br />
the applica?on before wri?ng to the database <br />
– Can avoid overhead where some part of data <br />
compresses well and some doesn’t <br />
– Can help distribute the CPU consump?on across <br />
applica?on servers <br />
132 <br />
www.percona.com
Compression Sta7s7cs <br />
• Per-‐index compression sta?s?cs <br />
– innodb_cmp_per_index_enabled <br />
• Enables I_S table INNODB_CMP_PER_INDEX <br />
• Can get expensive <br />
• Compression info exposed via I_S tables <br />
– Exposed via INNODB_CMP% tables <br />
– INNODB_CMP, INNODB_CMP_PER_INDEX, <br />
INNODB_CMPMEM, etc. <br />
133 <br />
www.percona.com
Alterna7ves to <strong>Innodb</strong> Compression <br />
• Compress on Applica?on <br />
• Use Filesystem/Storage which support <br />
compression <br />
• FusionIO – Hardware assisted compression <br />
– Inform Flash of “holes” which do not need to be <br />
kept <br />
• TokuDB Storage Engine <br />
– Offers much higher compression rate <br />
134 <br />
www.percona.com
Miscellaneous features and topics <br />
MISCELLANEOUS <br />
135 <br />
www.percona.com
Foreign Keys <br />
• Implemented on InnoDB level <br />
• Requires indexes on both tables <br />
– Can be very expensive some?mes <br />
• Checks happen when row is modified <br />
– No delayed checks ?ll transac?on commit <br />
• Foreign Keys introduce addi?onal locking <br />
overhead <br />
– Many tricky deadlock situa?ons are foreign key <br />
related <br />
136 <br />
www.percona.com
Direct IO Opera7on <br />
• Default IO mode for InnoDB data is Buffered <br />
• Good <br />
– Faster flushing when no write cache <br />
– Faster warm up on restart <br />
– Reduce problems with inode locking on EXT3 <br />
• Bad <br />
– Lost of effec?ve cache memory due to double buffering <br />
– Increased tendency to swap due to IO pressure <br />
• innodb_flush_method=O_DIRECT <br />
– O_DIRECT_NO_FSYNC on some filesystems <br />
– ALL_O_DIRECT on Percona Server <br />
137 <br />
www.percona.com
Direct IO Opera7on (cont.) <br />
• innodb_flush_method=O_DIRECT <br />
01010 0101 <br />
Buffered <br />
Log Files<br />
OS <br />
Cache <br />
Direct DMA write <br />
Buffer Pool<br />
Tablespace<br />
138 <br />
www.percona.com
Direct IO Opera7on (cont.) <br />
• innodb_flush_method=ALL_O_DIRECT <br />
01010 0101 <br />
Direct DMA write <br />
Log Files<br />
OS <br />
Cache <br />
Direct DMA write <br />
Buffer Pool<br />
Tablespace<br />
139 <br />
www.percona.com
READ-‐ONLY Mode <br />
• innodb_read_only=1 <br />
• Make all InnoDB tables and opera?ons read-‐only <br />
– Allows InnoDB to be operated from read-‐only media <br />
– Allows more than one MySQL instances to share the <br />
same dataset <br />
• Following must be done for this to work <br />
– InnoDB slow shutdown <br />
– InnoDB started with change buffering disabled <br />
140 <br />
www.percona.com
Persistent Op7mizer Stats <br />
• Persistent stats are enabled by default <br />
• Enable query execu?on plans to be more stable <br />
• Stats recalculated on-‐demand <br />
• Collect stats via ANALYZE/OPTIMIZE TABLE <br />
• Make sure you collect stats aler ALTER TABLE <br />
• Stats can be copied from master to slave <br />
– Stats tables: mysql.innodb_index_stats, <br />
mysql.innodb_table_stats <br />
141 <br />
www.percona.com
Unlocking Features in XtraBackup <br />
• Some important XtraBackup features are only <br />
available when used with XtraDB <br />
• Fast Incremental Backups <br />
– A bitmap based fast incremental backup strategy <br />
– Relies on “Changed Page Tracking” feature in <br />
XtraDB -‐ hSp://goo.gl/xhjLRc <br />
• Archive Log Backups <br />
– Keep history of modifica?ons in redo log archives <br />
– Use with XtraBackup -‐ hSp://goo.gl/W3bm8F <br />
142 <br />
www.percona.com
<strong>Innodb</strong> Full Text Search <br />
• Available since MySQL 5.6 <br />
• Is NOT 100% same as with MyISAM <br />
• Small/Medium scale implementa?ons <br />
– Larger ones use Solr, Sphinx, Elas?cSearch <br />
• Implemented as several “hidden” <strong>Innodb</strong> tables <br />
– Would require separate tutorial to cover in details <br />
• Loading a lot of data: <br />
– Load the data and when add FTS index <br />
143 <br />
www.percona.com
The Future <br />
SOME HIGHLIGHTS ON MYSQL 5.7 <br />
144 <br />
www.percona.com
More Online <br />
• Online Index Rename <br />
• Online VARCHAR length increase <br />
• Online OPTIMIZE TABLE <br />
145 <br />
www.percona.com
Temporary Tables <br />
• Faster Create/Drop of Temporary Tables <br />
• Can create temporary tables tablespace in <br />
separate path <br />
– <strong>Innodb</strong>_temp_data_path <br />
• Undo logs op?mized to be used for rollback <br />
only <br />
– No redo logging for temporary tables <br />
146 <br />
www.percona.com
General <strong>Performance</strong> <br />
• Automa?c detec?on of read only transac?ons <br />
• Internal Btree handling op?mized for lock <br />
conten?on <br />
• Mul?ple Cleaner Threads supported <br />
147 <br />
www.percona.com
Thank you ! <br />
pz@percona.com <br />
148 <br />
www.percona.com