Find closest database values across several attributes without O(n) traversal

All we need is an easy explanation of the problem, so here it is.

I have a use case where I have access to a large database with many documents (millions+). I want to build functionality such that, when I want to add a new document to that database, I first want to scan the database for potential duplicate documents. Documents may differ in file type and other information (i.e. one is a digital copy, while another is a hard copy that was scanned in via a printer) while still materially being the same document, so I unfortunately can’t rely on metadata or similar.

In order to determine if a document is a duplicate, I plan to use some weighted combination of NLP (via something like spaCy) to compare document similarity, comparison of word distributions (both the simple word distribution and something like TF-IDF), and other relevant metrics.

In order to actually find potential duplicates for a newly uploaded document, I can’t think of any way to avoid scanning every document in the database, comparing one-by-one, and tracking the one with metrics that matched most closely.

Thoughts I’ve had for optimizing this:

  • I know indexing can often be used to speed up search operations, but to my understanding that’s only good for searching for a specific value in a specific column. I don’t think it’s a good fit, as I’m trying to essentially take a weighted average of each metric and report which are closest. I guess I could index every column; could this potentially be worth it, or would the constant reindexing be a huge performance penalty?
  • I’ve been thinking of using a clustering (unsupervised machine learning) model in order to cluster similar documents together, then try and use the clusters it gives me to try and determine which cluster the new document would fit in, then just search in there, but I’m getting caught up in the details. I’m sure this would be a practical approach for finding preexisting duplicates in the database, but is this practical to do every time a new document is added to the set (i.e. could it be used to speed up the actual search through the database)? I’m not too well-versed in machine learning, so I’d appreciate some input on this.

So ultimately – is there a way for me to structure my database such that I don’t need a linear search in this scenario?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

You could define an index on each measure, query each to find the few documents that are closest, then build the intersection of these various results. That would turn a size-of-data comparison into a (number of measures) x (width of search in a measure). This is likely to be smaller than the number of rows for a sensible number of measures.

Depending on the DBMS and how good your SQL-fu is you may even be able to convince the query processor to perform these index read and then inner join the sub-results together to produces the intersection.

One big snag with this is outliers. Say we define 20 measures. An in-coming document matches an existing document exactly in 19 of these 20 but is out-of-range on the last. The existing document will never make it into the result even though a human would likely say the two matched. To avoid that you’d have to define degree of correlation in the index matches, and then you’re back to size-of data operations again.

What’s wrong with doing the usual vector-space comparison?

Intel reckons a modern server chip can do a few hundred gigaflops or so. Let’s call it 10^11 floating point calculations per second. To compare one document to another on 20 measures using Euclidean distance would need 80 computations. Let’s call it 100. That’s 10^9 comparisons per second. So to compare to 1M existing documents would take 10^-3 seconds, or 1 millisecond. Let’s knock off a few more zeros because data has to move around inside chips. How often will you be adding documents? How many scanners do you have & how long does OCR take to process? Can you live with a duplication check taking a bit less than one second per document.

Taking the Euclidean approach farther, if your DBMS supports a geometry data type you could choose the two or three most selective measures and define a spatial index on those. That should efficiently reduce the search of 1M+ existing documents down to a level where the full vector search is tractable.

Method 2

As a general hint: Use the fastest and easiest metrics to rule out the most of the documents, until you reach a document set that can be tested 1 to 1.

An idea would be the following: As you only want to find duplicates and not similarities, something very simple like storing the first 100 and the last 100 characters of a document in two separate columns helps to pre-select potential candidates. (100 is a random number, that you can adjust)

Then you can run a string similarity check like Levenshtein-Damerau with an appropriate threshold (see for example Optimizing the Damerau-Levenshtein Algorithm in TSQL) on these 2 columns with the new document’s start and end strings. If the documents are the same then the strings should match to the most part.

I don’t know which performance you expect/need. I ran the above UDF on 2.25 million records of max. column length 50 and different runs took me 20-40 seconds on my desktop computer.

Method 3

Near duplicate classification models tend to have high runtime overhead. A brute-force application of such a classifier over a large document database is impractical. One way to handle large-scale scenarios is to combine the near-duplicates classifier with a similarity search engine. The basic idea: index vector representations of the documents, use the index to retrieve a subset of similar documents, and apply near-duplicates detector over that subset.

Here is a deduplication example using a similarity search service and duplicates detector to handle millions of document vector embeddings.

(Mind that I am a co-author of that example.)

Method 4

I’ve ended up using ElasticSearch’s More Like This query for my use case. It offers text similarity queries that use TF-IDF as their behind-the-scenes metric, which have been both extremely fast and very accurate for me.

The project also involves doing some preprocessing and clustering on the data, which will be used to speed up the data lookup further by only focusing on elements within the same cluster for the duplicate search. I anticipate ElasticSearch will be 90% of the reason for the fast querying I’m looking for, but some of the clustering I mentioned will likely find its way into the end product.

The project is an iterative process, so this is subject to change, but it seems like a very good fit.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply