mysql: expected time for a simple group by on a single table, 2M records, single user?

All we need is an easy explanation of the problem, so here it is.

related to MySQL tuning (my.cnf) for very large tables and ad hoc queries

I have a really small table with 2M records, on a 32GB machine, running Windows 10 Pro with nothing serious running except MySql 5.7. I expected that simple group by on any un-indexed column would take seconds, as Mysql would cache the entire table. It takes about 3 minutes. Am I missing some important tuning tip? or are these the expected times?

the query

mysql> select specialty_code, count(*)
    -> from physicians
    -> group by specialty_code;
+----------------+----------+
| specialty_code | count(*) |
+----------------+----------+
| FM             |   199604 |
| GYN            |    14263 |
| OBG            |    78878 |
+----------------+----------+
3 rows in set (2 min 56.27 sec)

the table

mysql> describe physicians;
+-----------------------+---------+------+-----+---------+-------+
| Field                 | Type    | Null | Key | Default | Extra |
+-----------------------+---------+------+-----+---------+-------+
| physician_id          | int(11) | YES  |     | NULL    |       |
| practitioner_id       | int(11) | YES  | MUL | NULL    |       |
| state                 | text    | YES  |     | NULL    |       |
| specialty_code        | text    | YES  |     | NULL    |       |
| specialty_description | text    | YES  |     | NULL    |       |
| CBSA                  | text    | YES  |     | NULL    |       |
+-----------------------+---------+------+-----+---------+-------+
6 rows in set (0.10 sec)    

mysql> show table status like '%physicians%';
+------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+
| Name       | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time | Check_time | Collation       | Checksum | Create_options | Comment |
+------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+
| physicians | InnoDB |      10 | Dynamic    | 1963005 |             66 |   130711552 |               0 |            0 |   2097152 |           NULL | 2016-01-04 08:41:30 | NULL        | NULL       | utf8_general_ci |     NULL |                |         |
+------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+

sizing:

mysql> select count(*) from physicians;
+----------+
| count(*) |
+----------+
|  1957997 |
+----------+
1 row in set (9.91 sec) 

explain:

mysql> explain select specialty_code, count(*)
    -> from physicians
    -> group by specialty_code;
+----+-------------+------------+------------+------+---------------+------+---------+------+---------+----------+---------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra                           |
+----+-------------+------------+------------+------+---------------+------+---------+------+---------+----------+---------------------------------+
|  1 | SIMPLE      | physicians | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 1963005 |   100.00 | Using temporary; Using filesort |
+----+-------------+------------+------------+------+---------------+------+---------+------+---------+----------+---------------------------------+    

params I tuned:

join_buffer_size=2G 
tmp_table_size=5G
innodb_buffer_pool_size=20G
sort_buffer_size=2G

The table uses the default InnoDB engine.

NOTE: this is just a simplified symptom of my bigger problem: I have ~5-6 tables with a total volume of 10 GB. I expected MySql to swoosh all of them into the memory and answer any ad-hoc query within a minute or so… Should I index every column before filtering/grouping/joining on it?

Thanks and sorry if this is ranting. I’m just used to distributed databases crunching data in milliseconds for me 🙂

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

tmp_table_size=5G

No. Go back to the default. If multiple queries need tmp tables, you could run out of RAM.

Change the TEXT fields to reasonable-length VARCHARs.

The first table scan will do I/O to bring the table into RAM (if possible); subsequent references to the table will see the data cached. Did you run the GROUP BY twice to see this effect?

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply