Quick nearest neighbor search in the 150-dimensional space

All we need is an easy explanation of the problem, so here it is.

I want to create a database using any of the possible RDBMS. It will have a table with approximately 150 columns. The objective is to perform nearest neighbor search of some other objects. So it’s a NNS in the 150-dimensional space.

I already tried to use some obvious methods like L1 or L2 distances but of course it takes a lot of time for tables with many rows. Also I tried to look at the KD-tree (note I did not test it) and PG-Strom but they are not a good solution for data with a many dimensions.

Can I somehow improve the speed of the described search using math methods (like KD-tree) or tech methods (like PG-Strom)?

I will try to use any RDBMS which allow to improve speed of the NNS. But MySQL and PostgreSQL are the most appropriate DBMS for me.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

PostgreSQL 9.6 using cube

First install the cube extension

CREATE EXTENSION cube;

Now we will create some n-dimensional space with 100,000 points in 50 dimensions. In addition we’ll add a GIST index.

CREATE TEMP TABLE space_nd
AS
  SELECT i, cube(array_agg(random()::float)) AS c
  FROM generate_series(1,1e5) AS i
  CROSS JOIN LATERAL generate_series(1,50)
    AS x
  GROUP BY i;

CREATE INDEX ON space_nd USING gist ( c );
ANALYZE space_nd;

Now we will generate a single point and use the <-> operater to find the nearest point using Eucledian distance.

WITH points AS (
  SELECT cube(array_agg(random()::float)) AS c
  FROM generate_series(1,50)
    AS x
)
SELECT i,
  pg_typeof(space_nd.c),
  pg_typeof(points.c),
  cube_distance(space_nd.c, points.c)
FROM space_nd
CROSS JOIN points
ORDER BY space_nd.c <-> points.c
LIMIT 5;

PostgreSQL 9.6+ supports other distance operators over cube. All of which can use the GIST index we created. Namely,

a <-> b float8  Euclidean distance between a and b.
a <#> b float8  Taxicab (L-1 metric) distance between a and b.
a <=> b float8  Chebyshev (L-inf metric) distance between a and b.

That said there is one caveat,

To make it harder for people to break things, there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger.

You ask for 150 dimensions. That may present a minor complication.

Method 2

Consider performing dimension reduction first (eg. Principle Component Analysis).

Then your are doing NN in small number of dimensions with higher performance.

You can use Pl/R to perform PCA inside postgres if needed.

Method 3

Take a look at https://github.com/a-mma/AquilaDB it is a vector database to store Feature Vectors along with JSON Metadata. Keep it along with your RDBMS and do use metadata to maintain cross reference between data.

Method 4

Have a look at FLANN and OpenCV.

Unfortunately I am not aware of an integration of that into a RDBMS system. But there is for example integration of chemical structure information with Posgres. So in principle this can be done.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply