Why are data types important in SQL

Frequently in performance talks I hear speakers talk about the importance of using the correct datatypes for storing values in columns. i.e. representing a number with an or , storing IP addresses as , and instead of .

This advice is correct, but today I thought I would try my best to go into a bit more detail 🙂

The Reasons

I can think of three reasons why this optimization is true:

  1. Using numeric data types as strings incurs some added CPU overhead performing character-set and collation work. i.e. it’s not free to make , but MySQL behaves this way by default.
  2. Using correct data types will save space. By ‘space’ usually memory-fit is more important than disk fit, as it can improve cache efficiency. Disk fit can also be important with smaller SSDs.

  3. Some wire protocol and client library buffers are not variable in length. This is a little out of scope of my knowledge area, but you should expect more memory consumption with larger values – so going to just a little bit of effort judging expected length can help.

Memory Fit

I would rate reason (2) above as the most likely reason for why this optimization is correct. In fact, I would say that a large majority of optimizations made by DBAs are to try and stretch out memory fit for as long as possible. i.e. indexes prevent table scans,
and allow you to focus on just the data you need.

Lets take a look at memory fit improvements by concentrating on data types:

  • Short primary keys. There is a lot of space to be saved by keeping your as short as possible. InnoDB uses a clustered index, with the value of the also included in each secondary index as an internal row pointer.
  • Smaller values for indexed columns. Indexed values are essentially duplicated because they appear in both the index and the table row, so there is an amplification created by any inefficiency.

    How much of a memory fit improvement you will gain depends on how the index is accessed. For example:

    • requires all of the index in memory to be efficient.
    • Ranged access such as may also require more memory fit if the ranges are wide. However, this can also be true for non ranged access (i.e. ) if indexed_col is not unique and matches a number of rows.
    • If the index has good cardinality (i.e. a high distribution of values), and the queries executed focus on a smaller percentage of values, then memory fit may be less important.
  • Smaller values for un-indexed columns. It is also important to make sure that any un-indexed columns are small, but to a lesser degree. The case where I expect it to matter the most is if your workload includes table scans, or if there is the situation where ‘hot’ rows are highly scattered amongst a large number of pages.

    Small side-note: MySQL 5.6 allows you to change the InnoDB page size, which may help improve memory fit in this case too.

That’s all I can think of. What am I missing?

Published by

morgo

I joined MySQL AB in 2006, left, and am now back at Oracle working on the MySQL team. I’ve also worked at Percona and InPowered.

View all posts by morgo