The other day we were doing some peer review on a schema change at VividCortex and the topic of character set, column length, and the potential downsides of using utf8 came up. I knew from experience that there are some hidden gotchas with this, and usually I’ll just say a couple words and send a link with further reading. But Google turned up nothing, so I am writing this post to try to help fix that.
TL;DR version is that when MySQL can’t use an index for operations such as ORDER BY, it may allocate fixed-length memory buffers large enough to hold the worst-case values, and the same might apply to temporary tables on disk. This can be seen in EXPLAIN as “Using filesort; using temporary” and similar. And when this happens, you might end up making MySQL do gigabytes worth of work to finish a query on a table that’s only a fraction of that size.
Let’s see why this can happen.
The Theory of UTF8
In theory, you can “just use utf8 for everything,” and there is lots of advice on the Internet suggesting you should. There are good reasons for this. If you use a 1-byte character set, such as the default latin1, and put multibyte data into it (which is very easy to do), you can end up with quite a mess that’s pretty hard to undo. Lots of funny-looking characters can result. You know, those little diamonds with a question mark in the middle, or squares with four little digits inside instead of a character.
In theory, there’s no downside to using utf8 for ASCII data. The “normal” characters all fit in one byte in utf8 anyway, so the bytes end up being the same, regardless of the character set. You’d only get multi-byte characters when you go outside the ASCII range. (I’m being a bit vague with terms like ASCII to avoid the verbosity of being precise. Please correct me, or ask questions, if I’m taking too much of a shortcut.)
And in theory, you can use VARCHAR for everything, too. A VARCHAR begins with a little header that says how long the value is. For VARCHAR(255), for example, there will be 2 bytes that say how long the value is, followed by the data itself. If you use, say, VARCHAR(255) for a column, even if it’s not really going to store values that long, you theoretically pay no extra cost.
So in theory, this is a fine table definition:
CREATE TABLE `t1` ( `a` varchar(255) CHARACTER SET utf8 NOT NULL, `b` varchar(255) CHARACTER SET utf8 NOT NULL, KEY `a` (`a`) )
Now let’s talk about the reality.
Temporary Tables and Filesorts
Suppose you write the following query against the table shown above:
SELECT a FROM t1 WHERE a > 'abc' ORDER BY a LIMIT 20;
MySQL can execute this query by looking into the index on the
a column, finding the first value that’s greater than “abc”, and reading the next 20 values. The index is already sorted, so the ORDER BY is automatically satisfied. This is excellent.
The difficulty comes when, for example, you ORDER BY a different column.
SELECT a FROM t1 WHERE a > 'abc' ORDER BY b LIMIT 20;
In this case, MySQL will have to use a so-called “filesort.” Filesort doesn’t really mean files are sorted. It should be called “sort, which may use a file if it overflows the buffer.”
MySQL has a couple of sort algorithms. These are covered in some detail in High Performance MySQL, especially in Chapter 6 under Sort Optimizations. The manual also discusses them, as does Sergey Petrunia’s blog post. In brief, though, MySQL does a sort by putting values into a buffer in memory. The size of this buffer is specified by the
sort_buffer_size server variable. If all the values fit into the buffer, MySQL sorts in-memory; if not, then it writes to temp files on disk and does a merge-sort.
This is where the gotchas start to appear and theory diverges from reality in two important ways:
- VARCHAR isn’t variable-length anymore. The table’s on-disk storage may be variable, but the values are stored in fixed-length in memory, at their full possible width. Ditto for the temporary files.
- utf8 isn’t single-byte anymore. The fixed-length values are created large enough to accomodate rows that are all 3-byte characters.
So your “hello, world” row that consumes 2+12 bytes on disk suddenly consumes 2+(3*255)=767 bytes in memory for the sort. Or on disk, if there are too many rows to fit into the sort buffer.
It can potentially get worse than this, too. The dreaded “Using filesort” is bad enough, but it could be “Using temporary; Using filesort” which should strike fear into your heart. This means MySQL is creating a temporary table for part of your query. It might use an in-memory temporary table with the MEMORY storage engine, which pads rows to full-length worst-case; it might also use on-disk MyISAM storage for the temp table too. There’s more about this in High Performance MySQL as well.
This is a fairly involved topic, with lots of good reading in the manual, the Internet, and The Book Whose Name I Shall Not Keep Repeating.
Details aside, the point is it’s fairly easy (and not uncommon) to create a situation where your temp partition fills up with 10GB of data from a single innocent-looking query against a table that’s a fraction of that size on disk. All because of utf8 and/or long VARCHAR lengths!
My advice is generally to consider things like character sets and VARCHAR length limits in three ways:
- Constraints on what data can/should be stored.
- Accomodation for the data you want to store; make sure your database can hold what you want to put into it.
- Hints and worst-case bounds for the query execution process, which will sometimes be pessimistic/greedy and do the work the schema indicates might be needed, rather than the work that’s actually needed for the data that’s stored.
There’s some balance between overly tight schema constraints, which might force you to do an ALTER in the future (yuck, especially on large tables) versus overly loose constraints, which might surprise you in a bad way. It comes a bit from experience and unless you have a crystal ball, even with experience you’ll get bitten sometimes!
A relevant tweet I saw today:
Happy schema optimizing!