How to implement/architect a SQL database schema which can handle types/tables with many optional properties?

All we need is an easy explanation of the problem, so here it is.

I am working on a trimmed down database schema (in PostgreSQL) like the old FreeBase, just not quite as much stuff. There are about 100 tables so far, but that is before I’m considering adding dozens more for special situations, and it doesn’t feel quite right. Let me explain with a simplified example which replicates the problem in a small way, just imagine it being a lot more complex with a table having dozens of optional properties/relations/associations, and there being many different interconnected tables like this.

I am aware of and have extensively used NoSQL document databases like MongoDB, as well as have gotten my feet wet in Graph databases like Neo4j. They are tools I would rather avoid because of the complexity of this side project, and the tooling and resources for deployment out there are just not the same as with something like PostgreSQL in today’s world.

So to illustrate the problem, imagine a "symbols" table, which has all the unicode symbols, as well as 10’s of thousands of symbols outside of the scope of unicode (conscripts, mayan script, other non script symbols, etc.). The base table looks something like this:

table symbols {
  id
  unicode (optional)
  preview_image_url (optional)
  title
  description
}

Already we have a few optional properties, as some symbols don’t have unicode, and some don’t need preview images (all the unicode ones can just render in the browser, etc.). But then let’s think about some other "types" of symbols we want to store structured information about….

First, we can think of "script" symbols, the ones used for writing systems. Cool, we can add optional a "script_name" property to our table, not too bad. But no, what kind of script symbol? There are right-to-left scripts, vertical scripts, logographic scripts, alphabet scripts, abjads and abugidas, etc.. Some alphabetic scripts like the Latin script have symbols which have mirror images (like parentheses), or capital/lowercase pairs. Some scripts have combining characters with specific rules for which can combine with which. Some symbols are purely decorative, and some of those are geometric. So we try to account for all those optional features:

table symbols {
  id
  unicode (optional)
  preview_image_url (optional)
  title
  description
  is_logographic (optional)
  is_vertical (optional)
  is_rtl (optional)
  is_alphabet (optional)
  is_abjad (optional)
  is_abugida (optional)
  script_name (optional)
  mirror_image_symbol_id (optional)
  uppercase_symbol_id (optional)
  lowercase_symbol_id (optional)
  combining_class (optional)
}

Still, some might say that’s not too bad having all those optional properties, I don’t know.

Then you can go continue and add more sub-sub-types….

  • triangle-like symbols (there are a few of these in unicode)
  • shaded-triangle-like symbols
  • empty-triangle-like symbols

Just imagine all the possible things you could try to search google for in relation to symbols.

  • symbols that look like "c"
    • © (copyright symbol)
    • 🄯 (copyleft symbol)
    • ℃ (symbol for degrees Celsius)
    • ¢ (symbol for cent in U.S. currency)
    • ₡ (symbol for the colón, currency of Costa Rica and El Salvador)
    • ₵ (symbol for cedi, currency of Ghana)
    • ₢ (cymbol for cruzeiro, historical currency of Brazil)
    • ℄ (actually a "cl"
  • symbols with built-in combining marks like é.
  • 1-byte unicode glyphs
  • 2-byte unicode glyphs
  • 4-byte …

It starts becoming like this:

table symbols {
  id
  unicode (optional)
  preview_image_url (optional)
  title
  description
  is_logographic (optional)
  is_vertical (optional)
  is_rtl (optional)
  is_alphabet (optional)
  is_abjad (optional)
  is_abugida (optional)
  script_name (optional)
  mirror_image_symbol_id (optional)
  uppercase_symbol_id (optional)
  lowercase_symbol_id (optional)
  combining_class (optional)
  is_triangle_like (optional)
  is_shaded_triangle_like (optional)
  is_empty_triangle_like (optional)
  looks_like_c (optional)
  looks_like_d (optional)
  looks_like_l (optional)
  ...
  has_built_in_diacritic (optional)
  is_1_byte (optional)
  is_2_bytes (optional)
  ...
}

Soon we could end up with 50 or 100 optional fields. You can imagine this getting much more complex when you try to model "living organisms" and all their unique and various features! Thousands of optional features, and there is no clear OO class hierarchy to create subclasses from, it is more like a graph/web of interconnected combinations.

So my mind starts to go toward making things super abstract/generic, and creating a table such as called "facts", something like:

table facts {
  id
  object_type
  object_id
  property_name
  value_type
  value_id
}

That way you can create an object like "symbol a", and have "facts" on it like "property name is script_name and value type is a strings table with a string mapped to an ID, as a property on the symbol type of object". Or another fact is:

// facts table
id: 123
object_type: 'symbol'
object_id: 12321
property_name: 'is_1_byte'
value_type: 'boolean'
value_id: 444

// boolean table
id: 444
value: true

// symbol table
id: 12321
unicode: 'a'

But going down this road, you end up with just a handful of tables (the "fact" table basically, and 1 or 2 other meta tables perhaps), instead of 100. But then things become a lot harder to think about and visualize, and queries get a little more complex.

But I can’t see a way out of this problem. What I am leaning toward is having the DB be this abstract sort of "facts" table, but then in the application layer make it appear more object oriented and just like in JavaScript, it has the properties or it does not. I would like to "harden" this up a little, and give each combination/variation a different type name, but that doesn’t quite work out, so for example:

{
  type: 'alphabet_symbol',
  value: 'e'
},
{
  type: 'geometric_symbol',
  value: '▲'
}

And then build a tree of types:

symbol
  alphabet_symbol
    mirror_image_alphabet_symbol
      mirror_image_alphabet_symbol_with_capital_lowercase
      capital_lowercase_alphabet_symbol
  abjad_symbol
  geometric_symbol
    triangle_geometric_symbol
      shaded_triangle_geometric_symbol

But that breaks down:

symbol
  alphabet_symbol
    (cyrillic б)
  look_like_6_symbol
    (cyrillic б)

So then it’s like, maybe just add tags to the central object.

б
  id: 455

tags
  - name: 'is_alphabet_symbol'
    symbol_id: 455
  - name: 'looks_like_6_symbol'
    symbol_id: 455

But at that point it boils back down to the generic/abstract "facts" idea I originally shared when you start to try and handle more cases.

// facts table
id: 124
object_type: 'symbol'
object_id: 455
property_name: 'is_alphabet_symbol'
value_type: 'boolean'
value_id: 444

// boolean table
id: 444
value: true

// symbol table
id: 455
unicode: 'б'

So wondering, what is the recommended approach to handle the dynamic-ness and variations in the "types" as outlined briefly here? How do you balance the desire to capture as much structured data as possible without making a big optional-filled flat table (which seems like it breaks down after a few dozen optional columns, not to mention 100’s or 1000’s in the case of organism modeling).

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

It’s a pretty common idea to abstract the database layer, but in doing so you lose a lot of the relational aspect of the database system (which surprise surprise is not a good idea in a Relational Database Management System). This is known as the EAV anti-pattern and generally should be avoided for a few reasons, some inclusive of increased complexity with querying, as you’ve noticed, and others being performance problems.

It’s not the end of the world to have nullable attributes on your data object, even if that results in many columns. It’s not uncommon to see this with ERP system’s databases which typically have thousands to hundreds of thousands of tables, some with a few hundred to thousands of columns per table. But it’s not the best design either.

As David stated, the better solution is to normalize your data objects. A good way to think about it is refactoring out anything but the core attributes that are common to every (or most) records that object represents. Group other related attributes to their own table. That will indeed likely lead to more tables but as previously mentioned there is nothing functionally wrong with many tables. Of course it’s more work to maintain, but that is the tradeoff to a properly designed relational schema, at the benefit of improved performance (which otherwise can require a lot of work in itself), improved relationability, and improved data management and accuracy.

Method 2

In general you resist adding so many attributes to the model, and normalize the model. For each attribute ask 1) is it really necessary to add to your data model, and 2) does it really belong as an attribute of this entity, or should you introduce a new sub-type or referenced entity to the model.

Eg is_rtl probably should be an attribute of a related Alphabet entity, looks_like_d should be modeled in a seperate linking table, and is_shaded_triangle_like is probably not worth adding to the data model.

So the model evolves from a single table into a proper model, eg:

Symbol
  AlphabetSymbol
Alphabet
SybolLooksLikeSymbol
Shape
SymbolLooksLikeShape
SymbolUpperCaseOfSymbol

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply