All we need is an easy explanation of the problem, so here it is.
I am working with the harmonized system and try to catalog this in a MySQL database. This works for example like this:
01 LIVE ANIMALS 0101 horses, asses, mules and hinnies, live 0102 bovine animals, live 0103 swine, live 010310 Purebred breeding animals 010391 Weighing less than 50 (11023 lb) each 010392 Weighing 50 (11023 lb) or more each 0104 sheep and goats, live 0105 chickens, ducks, geese, turkeys, and guineas, live 0106 animals, live, nesoi Nesoi - not elsewhere specified of indicated.
As you can see, there is a clear parent-child relationship. This relationship can always be traced back by the numbering that is used. From this thread I learned that I best use a text data field for the codes (01, 0101, etc). My question is: should I somewhere log the parent-child relationship, or would this not be necessary? After all, I can do a:
SELECT * FROM accounts WHERE code LIKE '0103%'
And get all the subaccounts just fine. And I feel that a parent-child table could make things more complicated. But I’d like to hear what other people feel make sense.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
What you describe is a variant of “Materialized Path”, see for example https://communities.bmc.com/docs/DOC-9902 for an example. I say variant because you have no separator token, but are instead using a fixed number of positions for each different level.
In Materialized Path you usually don’t have an explicit parent/child relationship represented, instead it is encoded into the “path” string. To determine sub-categories, use like as in your example. For ancestor-categories you need some procedural part that splits the path-string. Direct parent is the first ancestor. Direct children can be located by using a combination of like and length() function. Root and leaf predicates are easy to define as well.
As a starting point, why would you like to add a parent-child relationship (if it is not for the operations described above)? If the answer is referential integrity it will be a bit backward since you already have the domain at hand (your existing table). Adding a separate relation (table) to represent the parent/child relationship would be something like:
create table ... ( category varchar(...) not null , constraint ... foreign key ... references ... , parent_category varchar(...) not null , constraint ... foreign key ... references ... )
You could have a trigger populate this table after insert of a new “category”
But it does not really protect you from anything, it would act more like a report of your encoded parent child relation.
The other option is to add a nullable attribute to your existing table, your “top-categories” would have null as a parent, and the others would have
substr(code, 1, length(code)-2) as a parent. This would be some kind of mix between “adjacency list” and “materialized path”. You would still have to protect against anomalies like (‘01020304’, NULL) and (‘010203′,’0202’). It is possible to add triggers that validates this.
If you want to keep your current table as is it is probably easiest to add validation code, either in triggers or in your application that makes sure that there exists a “super-category”.
If you are willing to split the code in parts you could have a look at “Nested Sets”. It is described in the link I gave above, or as an additional transitive closure relation, see for example http://karwin.blogspot.se/2009/04/sql-antipatterns-strike-back-slides.html. A slight variation of the same method is described at: http://dustbite.se/tree/ . Both “Nested Set” and “Transitive Closure” will add a significant overhead to what is probably the most common operation that you do
Personally I would stick with what you have and add validation triggers that checks that the codes are consistent during modifications of the “tree”
As Oracle XE is mentioned as an alternative DBMS in @Vérace post, I just point out that other commersial vendors is offering free versions as well. See Microsoft SQL Server 2014 Express (http://msdn.microsoft.com/en-us/evalcenter/dn434042.aspx) and DB2 Express-C ( http://www-01.ibm.com/software/data/db2/express-c/download.html). DB2 does not have a space restriction like Oracle, I’m not familiar with the restrictions of SQL-server.
There are many methods of representing trees in SQL. 4 are discussed by Bill Karwin here (and in his book which I recommend). This paper is also interesting (MySQL specific). See here (answer by Eldrith Conundrum) for a comparison of 4 different methods. Joe Celko recommends a further one, the directed graph method – I suggest you get your hands on a copy of his book “SQL for Smarties” or the subtome “Trees and Hierarchies in SQL”. You could also Google the work of Vadim Tropasko.
One of the key messages to take from much of this work is that the solution is very much dependent on your own usage patterns – some solutions are more suited to one pattern or another.
Basically, your question is very broad – some of the methods are better than others for the particular issue that interests you (subtrees), but others are better at obtaining the same level (think members of the same generation in a pedigree), some better for moving subtrees (think shifting a department with all of its staff from one line of business to another)… this list goes on. Unfortunately, there is no substitutes for trudging through the available work done and looking at how it could affect your needs.
p.s. just a thought. MySQL is probably the weakest server in this category – take a look at ltrees in PostgreSQL here (also see here) – it’s the closest I came to your issue. Also, if your database isn’t huge, you might be able to consider Oracle XE – up to 4GB of data.[EDIT in the light of @Lennart’s post]
SQL Server Express feature comparison.
IBM DB2 Express feature comparison.
Oracle XE feature comparison.
Seems like that IBM is the most generous. Interesting.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂