Conflict resolution for hash-backed unique constraint of large text bodies

All we need is an easy explanation of the problem, so here it is.

So we’ve just run into

ERROR: index row size 2736 exceeds maximum 2712 for index "foo__bar__un"
  Hint: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

(foo is table, bar is column and un for "unique")

Consider a function index of an MD5 hash of the value

Alright, let’s.
This is simple enough to do:

ALTER TABLE foo DROP CONSTRAINT foo__bar__un;
CREATE UNIQUE INDEX foo__bar__un ON foo(md5(bar));

So far, so good, this is the solution both proposed in the error hint and found online in numerous places.

What I haven’t found yet, though, is how to deal with collisions.

In my case, I have a database with results for biochemical experiments. I can’t just go to the scientists and say "can you please change your result a tiny bit so we can actually store it in our database? yes, I understand this may cost a couple of your customers’ lives and you personally a potentially prohibitive amount of money, but I can’t allow you to store it otherwise".

(Realistically, we’d just drop the constraint way before that, but you get my point.)

I do understand that collisions are almost negligibly rare, but I’d rather know I can deal with them if against all odds one should happen. And given that this isn’t a niche issue only we are having, I’m sure I’m not the only one to have ever wondered about this.

So what is the way to deal with collisions, in this case? I’d rather not have to replace the unique index with a DIY trigger-based workaround, if possible.

Can I somehow shoehorn in an additional "equals" check to be executed for entries with the same hash, for example, and only throw an exception if they actually are?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

I don’t know why the results of a biochemical experiment would be enforced to be unique in the first place, but you can use an exclusion constraint that tests for equality. It will automatically resolve hash collisions by doing the character-by-character comparison.

alter table foo add constraint foo__bar__un exclude using hash (bar with =);

But you should upgrade to something newer than 9.6.

Method 2

If you take the square approximation for the birthday problem with 2128 days in a year (that is the number of different MD5 hashes), and you want to know from how many table rows on the collision probability exceeds 0.000000001, you end up with

SELECT sqrt(2 * 2 ^ 128 * 1E-9);
      sqrt       
═════════════════
 824963474247119
(1 row)

So even if you had almost a quadrillion rows, the risk of a collision is so small that I would be more worried about a plane colliding with the data center.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply