Ever wondered how you should arrive at the indexing strategy to be used in your warehouse? What are the parameters to be considered while deciding on the fields to be indexed. Well,one parameter which I feel is important to consider is what I call as the 'Selectivity Factor'. The following is the definition of Selectivity factor( this is just my way of definition so no way of corroborating this)
Selectivity factor for a field - Percentage of rows selected from the table after applying the filter on the field.
You have to keep in mind that this factor will be different for different values of the same field. But again, you will have an estimate of the number of rows in the table for each of these values, so you can take an average of the Selectivity factor.The Average selectivity factor for a field is always the same irrespective of the SQL in which it is used.
Now if you have many SQLs having the same kind of filter ( a date field for example), then it makes sense to have an index on that field, right? Not always! What if the filter qualifies 90% of the records most of the time , then there will be no point in having that filter.In fact it will be an overhead, as the Database will first scan the index list then hit the actual row ids. If suppose you use a B*Tree index, then you might end up doing a very high number of logical I/Os, which will translate to reading the same set of blocks multiple times. But if you were reading only 10% of the table, then the number of logical I/Os will come down drastically. For example, suppose you have a query like this:
Select * from list_of_politicians where sex = 'M'
In this case the field "sex" has a cardinality of 2 -'M','F' (assuming there is no ambiguity about this data!). Assume that the table has 1,00,000 records out of which 90000 are with 'M' and 10000 with 'F'. Suppose the block size is 8kB and the row size is 80 bytes(which means 100rows/block ) then approximately 1000 block are used for the table.In this case, the query is returning 90% of the table. If the B*Tree index is used in this case then, a logical I/O is done for each of the 90000 records, which means on an average each block gets read 90 times!!! this will be a huge performance hit.Whereas if it had done a Full table scan, the execution would have been much faster.
That is where I feel , "Selectivity factor" makes much more sense as a quantifiable parameter to use for taking that decision. In the above scenario, the selectivity factor is 90% in case of 'M' and 10% in case of 'F' which means the Average Selectivity factor is 50%. This is a high number and thus not suited for B*Tree indexes. On the other hand ,a Bitmap index will be very useful in this scenario. In a bitmap index, each index entry will store references to many rows.The bitmap structure is something like this:
Row 1 2 3 4 5 6 7 8
M 0 1 1 1 1 1 1 1
F 1 0 0 0 0 0 0 0
As you can see there are only two entries here in the index.The first entry shows that the value 'M' is appearing in rows 2,3,4,5,6,7,8 and the value 'F' in row 1. So if I run the query,
Select * from list_of_politicians where sex = 'M'
the db can easily get the rowids of all the rows which have values 'M' for sex by reading only a few index entries( there will be more than one entry per value of a field based on the number of rows and the storage size). This will perform much much faster than in case of a B*Tree index.
So to sum it up,
When the selectivity factor(or the average selectivity factor) for the field is low ( say less than 10%) and a small portion of the table is selected, then it is helpful to have a B*Tree index on the field.
When the selectivity factor(or the average selectivity factor) for the field is high (say greater than 25 %) and you are selecting a small to medium portion of the table then it useful to have a Bitmap index on this field.
On the other hand , if the selectivity factor is high and you are reading a large portion of the table then it is better to leave the field alone.
Now all this applies to a Rule based optimizer. If you are using a Cost based optimizer, then just analyzing the table before running the query will tell the optimizer whether to use the index or not. So in the above case, where you are selecting 90% of the records, if the table is analyzed before running the query, then the CBO will decide to do a Full table Scan instead of using the B*tree index.
So next time you are planning your indexes, make sure to study the data, know the cardinality of the fields and also the Average Selectivity factor of the fields.