I wrote recently about database cardinality, and there’s a closely related topic that is equally confusing and I want to explain too: index selectivity.
Index selectivity is how tightly a database index helps narrow the search for specific values in a table. To illustrate this, imagine there’s a table full of all the people living in the United States: first name, last name, city, and so on. You’d like to query for a specific person. Here’s the general form of the query:
SELECT * FROM PEOPLE WHERE FIRST=? AND LAST=?
The question-marks are placeholders for real values. Now, if there’s an index on (FIRST, LAST), how will this query work? We don’t know until we fill in the placeholders.
If we fill in (“Baron”, “Schwartz”), it’ll find very few rows. I think there might be a couple more people with my name in the USA, but there aren’t very many. (Fun fact: I’ve only met one other person named Baron in my life: Baron Wormser, formerly the poet laureate of Maine.)
But if we execute the query with (“John”, “Smith”) it’s going to match a lot of rows. According to howmanyofme.com, it’ll match about 47,000 rows! If we’re looking for a specific John Smith, that’s a lot of results to sift through to find the one we want.
It could be worse, though: we could be searching for James, in an index that only includes the first name, leaving the last name un-indexed. There were 4.8 million people named James in the 1990 US Census records, out of a total of 249 million people—about 2% of the overall population and 3.3% of the male population. That’s a lot of rows to filter to find the one we want.
And that’s what databases do: if an index doesn’t narrow down the matching rows enough, the database will use the index to find a set of rows that match, and then scan and filter them to discard rows it doesn’t want. One of the most important purposes of an index is to narrow the initial candidate of matching rows as much as possible, reducing this work.
That’s where selectivity comes in. We’ve already been discussing selectivity, without naming it. Selectivity is a measure of how much the index narrows the search for values. There are two ways to express it:
- Average selectivity related to the cardinality. So if there are, say, 10,000 distinct first names in the United States (a number I’m having a hard time getting facts to support), the average selectivity of an index on the first name is 10,000/249,000,000=0.00004.
- The selectivity of a specific value is the number of rows with that value, divided by the total number of rows. The selectivity of “Baron Schwartz” is much better than “John Smith.”
A lower selectivity value is better: it means fewer rows to scan and filter. Oddly, though, we say an index is “highly” selective if it has a low selectivity value. ¯\_(ツ)_/¯
When a database plans a query, it’ll look in its index statistics and estimate how many rows the query will match. Databases can have simple statistics where they simply keep track of an estimate of cardinality, or they can have so-called “histogram” statistics where they track the frequency of ranges of values. In the first case, the database will compute an average selectivity to help plan the query; in the second case it might be able to get a more specific estimate. (Some databases also randomly sample the index during planning, to try to estimate the cardinality of specific values.)
Depending on how selective the answer is, and how sophisticated the database is, it might use the selectivity to decide between different query plans. For example, in some cases it might just scan the whole table rather than using the index! Accessing a table through an index adds some overhead, and if there’s no benefit for that cost, it might be quicker to avoid it.
So now you know what index selectivity means in a database! If I can clarify anything about this topic, please mention me on Twitter at @xaprb and I’ll do my best to improve the article.