When you send a query to MySQL, you usually don’t need to think explicitly about the types of the expressions in your query. If you compare this to a lot of programming languages, you’ll find that it’s not always the way things work. In strongly typed languages like Java, for example, typing is very strict.
In this respect, MySQL is much more like a dynamically typed language such as PHP or Perl — a DWIM (do what I mean) typing system. Yet, internally, every expression in MySQL has a type, and it does conversions amongst them as needed.
Sometimes, you might wonder how does this query work and exactly what’s happening to the variables in these expressions? Importantly, does it always work right?
Examples of MySQL Type Conversion
Let’s look at MySQL in action and see if we can figure out how it’s handling expression types. I’ll start with a simple example: it’s actually possible to add a character string to a number. Does the following expression produce the number 7, or the string
It produces the number 7. This ability to automatically convert between strings and numbers is very handy in some cases. But what if you add a string that’s not numeric?
SELECT "hello world" + 2;
What is the result of that expression, and what is its type? It is the number 2. Why? Looking at the warnings from that statement can help:
mysql> show warnings; +---------+------+-------------------------------------------------+ | Level | Code | Message | +---------+------+-------------------------------------------------+ | Warning | 1292 | Truncated incorrect DOUBLE value: 'hello world' | +---------+------+-------------------------------------------------+
That’s interesting — it shows that a few things happened:
- Apparently string conversion to numbers works by throwing away non-numeric suffixes (more on this later).
- The value
hello worldwas truncated to the empty string.
- The empty string was turned into the floating-point value 0.0.
- Presumably the value 2 was also treated as a float, hence it became 2.0.
Looks like a lot is going on behind the scenes. Let’s keep looking at some more examples and see what else we can find.
Example 2: Date Math
Date and time expressions in MySQL often look like strings, but you can do math on these types too. What is the underlying type of a date… is it a string, a number, some composite type? To try to find out, let’s fill a small table with some dates, using the
date column type:
mysql> select * from t; +------------+ | c | +------------+ | 2014-01-01 | | 2014-01-28 | | 2014-12-31 | +------------+
Now let’s execute a query that has an expression involving a date and a number on one side, and a string on the other. We’ll look for rows where the date is before January 21st. This should be fun. What’ll happen?
mysql> select c from t where c+10 < '2014-02-01'; Empty set, 1 warning (0.00 sec)
Whoops, nothing happened except for this warning:
Truncated incorrect DOUBLE value: '2014-02-01'. There are certainly values in the table that meet the criterion we tried to express. But it looks like the type of the entire expression is DOUBLE. That is, we have an expression whose LHS is composite (date plus number) and evaluates to DOUBLE, and the RHS is a string that is apparently coerced to a type as required by what’s on the left. Hmmm, some type precedence rules might be emerging here. Let’s try something else:
mysql> select c from t where c+10 < '2014-02-01'+interval 2 day; +------------+ | c | +------------+ | 2014-01-01 | | 2014-01-28 | +------------+
Looks good! Apparently the expression on the right was treated as a
DATE type, and the expression on the left was coerced to that too… or was it? Let’s try a variation, looking across a year boundary:
mysql> select c from t where c+10 < '2015-01-01'+interval 2 day; +------------+ | c | +------------+ | 2014-01-01 | | 2014-01-28 | | 2014-12-31 | +------------+ 3 rows in set (0.00 sec)
Something’s wrong. That last row shouldn’t be included in the results. Let’s see what’s really happening in this query, by selecting each subexpression separately:
mysql> select c, c+10, '2015-01-01'+interval 2 day from t where c+10 < '2015-01-01'+interval 2 day; +------------+----------+-----------------------------+ | c | c+10 | '2015-01-01'+interval 2 day | +------------+----------+-----------------------------+ | 2014-01-01 | 20140111 | 2015-01-03 | | 2014-01-28 | 20140138 | 2015-01-03 | | 2014-12-31 | 20141241 | 2015-01-03 | +------------+----------+-----------------------------+
So apparently what happens is the left-hand-side treats a
DATE as a number and adds a number to it. The result is 20141241, which isn’t a valid date. Everything’s being converted to
DATE, in the expression in the WHERE clause. I remember the first time I saw this. I was shocked because I’d learned by copying another programmer’s casual use of addition for
DATE types. I wonder how much similar code out there in the wild is breaking every time it approaches year or month boundaries!
I happen to know that a
DATE value is handled internally as a number with the year, month, and day encoded into the decimal positions you see above. Clearly, the
DATE type has characteristics that result directly from this underlying type.
Example 3: Out-Of-Place Parens
Someone asked me the following on Twitter:
> SELECT RIGHT(title,LENGTH(title+1)) FROM news > > returns the last character. If it were title-1 the last two chars returned. why?
Look carefully at the expression. The query is presumably trying to get the last character from a string of text, or some variation of that. What do you suppose is happening?
The problem is he’s trying to add and subtract a value (1 or -1) from the length of the title, but that’s not what he’s doing at all. He’s adding and subtracting 1 from the title itself. What do you suppose is in that title? Probably values like “To Kill A Mockingbird.” What happens when you add 1 to that or subtract 1 from that? Even more fun, what’s going to happen when the query processes the row containing “Catch-22” or “5 Ways To Be A Better Coder”? Pretty meta, isn’t it?
This was on Twitter, so who knows, but my guess is MySQL was showing him a warning and he wasn’t seeing it or was ignoring it. The fix, of course, was to move the math outside the parens. (Someone suggested that the fix was to use a stricter database, which isn’t a bad answer either depending on the use case).
What Are The Type Conversion Rules?
Most of the above examples seem to display some kind of logic, some kind of rules for type conversion. But what are they?
I cut my teeth on SQL Server, although it’s been a long time. I remember it being really strict about typing, requiring types to be converted to compatible types for operations. You certainly could not just add a number to a string; it would not even execute such a query. I also remember the documentation was very clear on what types were compatible, what kinds of precedence each type had relative to other types, and the result of various operations on two compatible but different types.
I found the SQL server type documentation and it was pretty much as I remembered. PostgreSQL’s behavior is also clearly defined. The relevant docs are here. MySQL’s documentation is less formal in my opinion. You can find it here.
The effect is that there’s more room for ambiguity in MySQL. An expression’s type appears to be context-dependent, not dependent merely on the types of the operators. I’m not 100% sure this is true, but it seems to be true in some cases.
Many functions or operators will also accept inputs in several types, so it’s not clear what type the argument is. For example, there is no dedicated syntax for a
DATE literal. You use a string literal and put it into a
DATE context. This is quite different from PostgreSQL, for example, where such types are clearly identified by syntax.
Maybe I haven’t read the docs closely enough (though I think I tried pretty hard), but in MySQL I’m still not always sure what type the result of a function will be. To return to date/time examples (useful because of their ambiguity in syntax), the
DATE_ADD() function will accept a string, date, or number as input, and will return either a date or datetime depending on the input. Or at least, I think it will:
mysql> select date_add('2014-01-01', interval 2 day), date_add(20140101, interval 2 day), date_add('2014-01-01 12:34:56', interval 2 day)\G *************************** 1. row *************************** date_add('2014-01-01', interval 2 day): 2014-01-03 date_add(20140101, interval 2 day): 2014-01-03 date_add('2014-01-01 12:34:56', interval 2 day): 2014-01-03 12:34:56
It’s unclear exactly what types those column values are. Are the first two
DATETIME with a truncated fractional part, or are they
DATE? This type of ambiguity usually doesn’t bother me (the server does what I mean and there’s low cognitive load for me), but when I really want to know what’s going on, it’s sometimes hard for me to feel sure that I do.
When Queries Are Obfuscated
One of the things that makes this even a little bit worse for me is that I usually use prepared statements, so I don’t get to see how the sausage is made. When the query gets to MySQL, is my question-mark placeholder replaced with a literal numeric parameter, or is it quoted and interpreted as a string even though I sent a numeric type into the query? This can even be driver-specific; the Perl dbd::mysql driver, for example, uses emulated client-side prepared statements by default, so all the parameters are sent to MySQL as quoted strings.
By the way, in VividCortex (unlike the MySQL slow query log) we explicitly differentiate between prepared and non-prepared statements. You can see statement preparation, execution, and close events separately. You’ll see this in the Action column in Top Queries.
If you’re using an ORM or another system to hide the SQL from you, it’s even harder to check for correctness, because it can be hard to even find out what SQL is really running.
The foregoing has mostly been about correctness, implicit behavior, and silently wrong answers and the (in)ability to know about them. But this goes beyond silent bugs, and can sometimes impact performance too. (At this, my bored audience emits an audible gasp; silent bugs are OK, but slow queries are not! Give me the wrong answer, but make it fast!)
These performance problems are usually counter-intuitive and unexpected, too. The canonical example from the MySQL documentation is an indexed string column that’s compared to a literal number:
SELECT * FROM tbl_name WHERE str_col=1;
As the manual says,
The reason for this is that there are many different strings that may convert to the value 1, such as ‘1’, ‘ 1’, or ‘1a’.
The translation, if you’re not sure what that means, is that if you’re using literal numbers as parameters in queries and you forget to quote them, the situation above will result in an instant table scan. Oops! I’ve seen this happen many times in my consulting career, as well as my own use of MySQL.
MySQL isn’t very explicit or strict about the types of expressions in queries, and as far as I know, it doesn’t have specific syntax to remove the ambiguity. It’s usually fine — it usually does just what you mean — but not always, and it can be hard to figure out why not.
Some solutions to this kind of problem are to:
- Always check for warnings and errors. This can be really hard, and needs to be done in production. Shameless plug: VividCortex makes it easy for developers to self-service and get quick access to this information, in production as well as staging/dev. See this blog post for more on that.
- Learn to read code carefully, and get SQL code reviews. You can write bugs in SQL just like you can in other languages.
Did I miss details, overlook documentation, or get the facts wrong? Please let me know in the comments.
Finally, many other databases disallow ambiguous expressions or implicit type conversions. This is sometimes tedious or annoying, but usually not a big deal, and I much prefer explicitness and safety to convenience. I wonder if it’s realistic to hope for MySQL to become more strict in the future, or if that would be too much of a breaking change?