How does Wikipedia store article diffs (or any site which contains lots of text diffs per record)?

All we need is an easy explanation of the problem, so here it is.

Looking at the Obama diffs on Wikipedia article history, there are probably thousands of diffs. Clicking on a specific older diff took about 3-5 seconds to load, as opposed to the more snappy Wikipedia pages. Then you can see the specific diff between two versions, the old old version, and the current version.

How does Wikipedia store article diffs (or any site which contains lots of text diffs per record)?


The question is, how do they efficiently store these diffs, if they are not using git (I would assume)?

I am searching for how to implement text diff tracking in a small toy app I am working with. There will be potentially a million pages, each with records which might have anywhere from 0 to 1000 edits let’s say. The edits will be much smaller than Wikipedia articles, sometimes on the order of the character diff of one word, other times a few sentences.

The recommended approach I have seen so far is to store the current full text form of the record text, and then store "reverse diffs" to get to the previous one, and the previous one from that, etc.. Then to hydrate an old article, you would fetch all the old diffs (like 3000 diffs in that Wikipedia diff probably), and rewind from the current text to get the old version. This seems like it would be horribly inefficient though (and I’m not sure how to compute a "reverse" diff, such as with jsdiff). Is it the best way to do it? I guess you wouldn’t want to store a copy of the full text on every change, or does Wikipedia do that? That would explode in content size I would imagine, especially on Wikipedia.

Basically wondering how to implement Wikipedia-like diffing by somehow storing the content in a SQL database. My case won’t have as large of content as Wikipedia, but it would be good to know how to solve it properly for this "worse case".

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Just a heads up, this is more of a programming question than it is a database question. Let me explain why, which should answer your question at a high level:

Comparing the differences between two bodies of text is not a standard database function, rather it’s something usually done in application code. There are a multitude of applications currently that do this very function (most without a database behind them at all), for example: Beyond Compare, ExamDiff, Git, SQL Examiner.

Yes, databases have implemented features such as Full Text Search, or quasi-database technologies have been developed specifically for text parsing and word search, such as Elasticsearch. But they exist for specific use cases and those use cases are not really inclusive of revision controlling large bodies of text.

The database is good at storing that data at rest, and finding specific instances of that data quickly so that it can then be consumed and utilized as needed. For example, with Wikipedia, in theory it should be very quick (e.g. less than a second) to find and load a specific version of the article for a specific ArticleId (when indexed properly) regardless if there’s 1 revision or 1 billion revisions of the same article.

The additional time you probably experienced is on the application side when it needs to compare the differences between two versions, especially if there’s a lot of differences between those two versions. Though if the body of text being compared isn’t too huge (millions of words), even that comparison on the application side should be relatively quick too (as like the aforementioned comparison applications usually function quite well for decently sized bodies of text).

The recommended approach I have seen so far is to store the current full text form of the record text, and then store "reverse diffs" to get to the previous one, and the previous one from that, etc.

Sure, you could do that to save on space. It won’t make the database side any faster. I agree, my instincts are it’ll make the comparison slower on the application side (though I haven’t tried such an approach myself), so I don’t think this would be the route I’d go.

But my overall takeaway point here is let the database side do what it’s good at: storing many records, and locating and loading a subset of those records. Again it doesn’t matter if there’s 1 billion versions of the same article, it’ll take roughly the same time (which should be under a second), when indexed properly, for the database to accomplish those functions.

Then let the application handle the comparison of two versions of the article in the most efficient way possible. This is where this question becomes more of a programming one which will depend on the language and frameworks you’re using. So you’re probably better off asking how to accomplish that piece of the workflow on StackOverflow.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply