All we need is an easy explanation of the problem, so here it is.
I am designing my first database, and I find myself frustrated by the choice between storing an integer or a string for each instance of a categorical variable.
My understanding is that if I have a table containing cities that I want to make a child of a table of countries, the most performant way to do that is to have the PK of the countries table as a FK in in the table of cities. However for ease of use and debugging, it’s nice to always have the string name associated with the country PK. Every solution I have considered either is not recommended or seems overly complex.
I’d like opinions the merits of these approaches (or hear about new ones) and also to understand if it has to be this way or if databases simply are this way because of tradition.
Use a string as a PK for countries. Then I will have a human-readable FK for it in any child tables. Obviously less performant than using integers, but I suspect it may be the least worst way to have the convenience I desire.
Create a view using application logic that join each the string name of the country to the states table.
- I don’t love this because if the application logic breaks, the tables become less readable. Also I would expect large join operations to have an even worse performance penalty than string PK/FKs.
- Create a separate table to connect numeric IDs with the appropriate string ID. I’m not sure if it would be better to have a table coding each type of relation, or one big table with one big pool of IDs that cover all integer key-string value relations. I could then use application logic to look up the appropriate strings and fill the appropriate PK into the child table when it’s string name is given by a user.
- I feel like this might be pretty resource intensive too, as there would have to be a lookup every time a new row was added to the child. It also means that I would still have to create the views I want.
enumdata type. Instinctively, this would be my go-to approach, as it seems the ideal balance between natural and synthetic keys: Use integer IDs and give the IDs a string label so that the string itself need not be repeated.
- Unfortunately my research has found that this is not recommended. One reason for that is that categories cannot be deleted easily. I’m not sure if that is dealbreaker for me, but I also wonder why DBMSs are designed this way. Aren’t categorical variables commonly used enough to add convenience features for them?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
While ypercube makes a good and logical point for the specific example of
Countries, I would otherwise avoid using string-based data types because of the potential unexpected implications that can arise from certain assumptions different database systems make about strings. For example, in Microsoft SQL Server, the optimizer generally assumes
VARCHAR columns are half full and will generate execution plans that request memory based on that assumption. This could result in over (or even under) allocation of memory resources to serve a single query. I imagine there are other interesting assumptions that other database systems also make around string-based data types, for better or for worse.
But even more important than performance is data accuracy. The number one job of a table is to store data, and ideally store it accurately. The number one job of a primary key is to establish uniqueness, and ideally it should be immutable. Surrogate keys have the benefit of ensuring all of these things remain true in the case when the human-readable value has the ability to change. This actually follows another good principal called one-field one-purpose because the surrogate key’s meaning is completely decoupled from the business object’s.
Going back to your
Countries example, it’s not usual that the name of a
Country would change, but it’s not impossible either. A few countries have changed names in the last 50 years. Even using the ISO code is not 100% a guarantee that it’ll never change for a given
Country because there is some meaning in how those codes are generated (albeit being more removed from the business object than using the human-readable value of the business object itself).
So if the natural key value is used, and is liable to change, the day it does change, now you risk data accuracy because not only do you need to ensure the
Countries table is properly updated, you must do the same for every table that references
Countries in a foreign key.
There’s additional performance overhead with updating every record referencing the old value as well, of course, as opposed to just updating it in one place when a surrogate key is used as the primary key. But the bigger concern (going back to the primary goal of a table) is data accuracy, in my opinion.
Views are great tools for the job of unifying, transforming, and presenting the data to the application layer, and even help with data maintenance later on, in some cases, such as when your table structure needs to change. Since a view can act as a layer between the application and the database tables, there’s less risk for the application when changing the structure of those tables. There’s nothing inherently wrong with using them from a performance perspective, and
JOIN performance (by the surrogate keys) should not be an issue with a properly architected and indexed database.
When is it better to do JOINs with a special lookup table vs have the human-readable columns in the parent table and have them inherit through the hierarchy?
It depends. For lack of a more articulate way to describe it, generally it makes sense to refactor the human-readable value into a separate table when it’s currently being repeated in the main table it exists in. This is so there’s a place to uniquely define that value that can easily and accurately be maintained. When done properly, this loosely follows the principals of normalization.
If by "special lookup table" you mean a single table (e.g. the enums table your post mentions) for multiple kinds of objects, I wouldn’t recommend doing that. It may be easier to maintain than multiple separate object tables, but you lose some of the relational properties of a relational database system.
why surrogate keys can’t have mutable string labels that propagate through all uses as an FK?
This goes back to data accuracy, primarily speaking. Nothing stops you from doing it, it’s just not best practice because of the added risk against data accuracy, and makes data management harder and less performant when you need to update the value. You run the risk of lock escalation if it’s a common value in the foreign keyed table, causing potentially longer wait times and blocking for read queries against that table.
why not deal with a changing country name by creating a new row for the new state + a column for years of existence?
Some people implement this design but more so because their business rules and use cases depend on historical data tracking. But for a regular transactional database with standard use cases, it inflates your data and still doesn’t solve the aforementioned foreign key references where you’d have to update them with the changes too or inflate those tables as well. Even if I had the use cases to maintain historical data, I’d personally store the transactional history in a separate historical table from the active records.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂