All we need is an easy explanation of the problem, so here it is.
I want to create a database using any of the possible RDBMS. It will have a table with approximately 150 columns. The objective is to perform nearest neighbor search of some other objects. So it’s a NNS in the 150-dimensional space.
I already tried to use some obvious methods like L1 or L2 distances but of course it takes a lot of time for tables with many rows. Also I tried to look at the KD-tree (note I did not test it) and PG-Strom but they are not a good solution for data with a many dimensions.
Can I somehow improve the speed of the described search using math methods (like KD-tree) or tech methods (like PG-Strom)?
I will try to use any RDBMS which allow to improve speed of the NNS. But MySQL and PostgreSQL are the most appropriate DBMS for me.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
PostgreSQL 9.6 using
First install the cube extension
CREATE EXTENSION cube;
Now we will create some n-dimensional space with 100,000 points in 50 dimensions. In addition we’ll add a GIST index.
CREATE TEMP TABLE space_nd AS SELECT i, cube(array_agg(random()::float)) AS c FROM generate_series(1,1e5) AS i CROSS JOIN LATERAL generate_series(1,50) AS x GROUP BY i; CREATE INDEX ON space_nd USING gist ( c ); ANALYZE space_nd;
Now we will generate a single point and use the
<-> operater to find the nearest point using Eucledian distance.
WITH points AS ( SELECT cube(array_agg(random()::float)) AS c FROM generate_series(1,50) AS x ) SELECT i, pg_typeof(space_nd.c), pg_typeof(points.c), cube_distance(space_nd.c, points.c) FROM space_nd CROSS JOIN points ORDER BY space_nd.c <-> points.c LIMIT 5;
PostgreSQL 9.6+ supports other distance operators over
cube. All of which can use the GIST index we created. Namely,
a <-> b float8 Euclidean distance between a and b. a <#> b float8 Taxicab (L-1 metric) distance between a and b. a <=> b float8 Chebyshev (L-inf metric) distance between a and b.
That said there is one caveat,
To make it harder for people to break things, there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger.
You ask for 150 dimensions. That may present a minor complication.
Consider performing dimension reduction first (eg. Principle Component Analysis).
Then your are doing NN in small number of dimensions with higher performance.
You can use Pl/R to perform PCA inside postgres if needed.
Take a look at https://github.com/a-mma/AquilaDB it is a vector database to store Feature Vectors along with JSON Metadata. Keep it along with your RDBMS and do use metadata to maintain cross reference between data.
Have a look at FLANN and OpenCV.
Unfortunately I am not aware of an integration of that into a RDBMS system. But there is for example integration of chemical structure information with Posgres. So in principle this can be done.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂