Wednesday, July 15, 2015

MySQL - Speed of SELECT Queries ?



In general, when you want to make a slow SELECT ... WHERE faster, the first thing to check is whether or not you can add an index.

All references between different tables should usually be done with indexes.
You can use the EXPLAIN command to determine which indexes are used for a SELECT.

Some general tips:

To help MySQL optimize queries better, run myisamchk --analyze on a table after it has been loaded with relevant data. This updates a value for each index part that indicates the average number of rows that have the same value. (For unique indexes, this is always 1, of course.). MySQL will use this to decide which index to choose when you connect two tables with 'a non-constant expression'. You can check the result from the analyze run by doing SHOW INDEX FROM table_name and examining the Cardinality column.
To sort an index and data according to an index, use myisamchk --sort-index --sort-records=1 (if you want to sort on index 1). If you have a unique index from which you want to read all records in order according to that index, this is a good way to make that faster. Note, however, that this sorting isn't written optimally and will take a long time for a large table!

How MySQL Optimizes WHERE Clauses ?
The WHERE optimizations are put in the SELECT part here because they are mostly used with SELECT, but the same optimizations apply for WHERE in DELETE and UPDATE statements.

Also note that this section is incomplete. MySQL does many optimizations, and we have not had time to document them all.

Some of the optimizations performed by MySQL are listed below:

Removal of unnecessary parentheses:
((a AND b) AND c OR (((a AND b) AND (c AND d))))
-> (a AND b AND c) OR (a AND b AND c AND d)

Constant folding:
(a
-> b>5 AND b=c AND a=5

Constant condition removal (needed because of constant folding):
(B>=5 AND B=5) OR (B=6 AND 5=50) OR (B=7 AND 5=6)
-> B=5 OR B=6

Constant expressions used by indexes are evaluated only once.
COUNT(*) on a single table without a WHERE is retrieved directly from the table information. This is also done for any NOT NULL expression when used with only one table.
Early detection of invalid constant expressions. MySQL quickly detects that some SELECT statements are impossible and returns no rows.
HAVING is merged with WHERE if you don't use GROUP BY or group functions (COUNT(), MIN()...).
For each sub-join, a simpler WHERE is constructed to get a fast WHERE evaluation for each sub-join and also to skip records as soon as possible.
All constant tables are read first, before any other tables in the query. A constant table is:
An empty table or a table with 1 row.
A table that is used with a WHERE clause on a UNIQUE index, or a PRIMARY KEY, where all index parts are used with constant expressions and the index parts are defined as NOT NULL.
All the following tables are used as constant tables:
mysql> SELECT * FROM t WHERE primary_key=1;
mysql> SELECT * FROM t1,t2
WHERE t1.primary_key=1 AND t2.primary_key=t1.id;

The best join combination to join the tables is found by trying all possibilities. If all columns in ORDER BY and in GROUP BY come from the same table, then this table is preferred first when joining.
If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.
If you use SQL_SMALL_RESULT, MySQL will use an in-memory temporary table.
Each table index is queried, and the best index that spans fewer than 30% of the rows is used. If no such index can be found, a quick table scan is used.
In some cases, MySQL can read rows from the index without even consulting the data file. If all columns used from the index are numeric, then only the index tree is used to resolve the query.
Before each record is output, those that do not match the HAVING clause are skipped.
Some examples of queries that are very fast:

mysql> SELECT COUNT(*) FROM tbl_name;
mysql> SELECT MIN(key_part1),MAX(key_part1) FROM tbl_name;
mysql> SELECT MAX(key_part2) FROM tbl_name
WHERE key_part_1=constant;
mysql> SELECT ... FROM tbl_name
ORDER BY key_part1,key_part2,... LIMIT 10;
mysql> SELECT ... FROM tbl_name
ORDER BY key_part1 DESC,key_part2 DESC,... LIMIT 10;

The following queries are resolved using only the index tree (assuming the indexed columns are numeric):

mysql> SELECT key_part1,key_part2 FROM tbl_name WHERE key_part1=val;
mysql> SELECT COUNT(*) FROM tbl_name
WHERE key_part1=val1 AND key_part2=val2;
mysql> SELECT key_part2 FROM tbl_name GROUP BY key_part1;

The following queries use indexing to retrieve the rows in sorted order without a separate sorting pass:

mysql> SELECT ... FROM tbl_name ORDER BY key_part1,key_part2,..
. mysql> SELECT ... FROM tbl_name ORDER BY key_part1 DESC,key_part2 DESC,...

How MySQL Optimizes DISTINCT ?
DISTINCT is converted to a GROUP BY on all columns, DISTINCT combined with ORDER BY will in many cases also need a temporary table.

When combining LIMIT # with DISTINCT, MySQL will stop as soon as it finds # unique rows.

If you don't use columns from all used tables, MySQL will stop the scanning of the not used tables as soon as it has found the first match.

SELECT DISTINCT t1.a FROM t1,t2 where t1.a=t2.a;

In the case, assuming t1 is used before t2 (check with EXPLAIN), then MySQL will stop reading from t2 (for that particular row in t1) when the first row in t2 is found.

How MySQL Optimizes LEFT JOIN and RIGHT JOIN ?
A LEFT JOIN B in MySQL is implemented as follows:

The table B is set to be dependent on table A and all tables that A is dependent on.
The table A is set to be dependent on all tables (except B) that are used in the LEFT JOIN condition.
All LEFT JOIN conditions are moved to the WHERE clause.
All standard join optimizations are done, with the exception that a table is always read after all tables it is dependent on. If there is a circular dependence then MySQL will issue an error.
All standard WHERE optimizations are done.
If there is a row in A that matches the WHERE clause, but there wasn't any row in B that matched the LEFT JOIN condition, then an extra B row is generated with all columns set to NULL.
If you use LEFT JOIN to find rows that don't exist in some table and you have the following test: column_name IS NULL in the WHERE part, where column_name is a column that is declared as NOT NULL, then MySQL will stop searching after more rows (for a particular key combination) after it has found one row that matches the LEFT JOIN condition.
RIGHT JOIN is implemented analogously as LEFT JOIN.

The table read order forced by LEFT JOIN and STRAIGHT JOIN will help the join optimizer (which calculates in which order tables should be joined) to do its work much more quickly, as there are fewer table permutations to check.

Note that the above means that if you do a query of type:

SELECT * FROM a,b LEFT JOIN c ON (c.key=a.key) LEFT JOIN d (d.key=a.key) WHERE b.key=d.key

MySQL will do a full scan on b as the LEFT JOIN will force it to be read before d.

The fix in this case is to change the query to:

SELECT * FROM b,a LEFT JOIN c ON (c.key=a.key) LEFT JOIN d (d.key=a.key) WHERE b.key=d.key

How MySQL Optimizes LIMIT ?
In some cases MySQL will handle the query differently when you are using LIMIT # and not using HAVING:

If you are selecting only a few rows with LIMIT, MySQL will use indexes in some cases when it normally would prefer to do a full table scan.
If you use LIMIT # with ORDER BY, MySQL will end the sorting as soon as it has found the first # lines instead of sorting the whole table.
When combining LIMIT # with DISTINCT, MySQL will stop as soon as it finds # unique rows.
In some cases a GROUP BY can be resolved by reading the key in order (or do a sort on the key) and then calculate summaries until the key value changes. In this case LIMIT # will not calculate any unnecessary GROUP BY's.
As soon as MySQL has sent the first # rows to the client, it will abort the query.
LIMIT 0 will always quickly return an empty set. This is useful to check the query and to get the column types of the result columns.
The size of temporary tables uses the LIMIT # to calculate how much space is needed to resolve the query.


No comments: