A good dataset to experiment NoSQL databases

All we need is an easy explanation of the problem, so here it is.

I need to do some experiments in HBase and Cassandra and to do that I need an adequate dataset.

The dataset I’m looking for has to be large enough (i.e. more than 2GB) and the data in it has to be sufficiently unstructured to be representative of the kind of problems that relational technology can’t cope. Maybe data derived from social networks, and so on.

Does anyone have that kind of dataset or knows where can I find such a dataset?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

[EDIT – added several public dataset sites].

First off, there is no real evidence that NoSQL databases are “better” at handling large datasets than traditional (OldSQL) RDBMSs. Check out Ted Dziuba’s article about how he can’t wait for NoSQL to die. He makes the point that Walmart continue to use RDBMSs – and they’re not a small company!
He says that NoSQL is, and should remain, niche and that in all likelihood, you don’t need it. He also makes the reasonable point that Facebook, Google and Twitter are not normal companies with normal data processing needs.

Google Michael Stonebraker‘s writings on OldSQL, NoSQL and NewSQL (e.g. 1, 2, 3). He makes the point that NoSQL throws the baby out with the bathwater – i.e. NoSQL doesn’t enforce ACID transactions and that this is an abomination for a database system. As you’ll see from his bio, he’s been involved in databases as an academic and in industry for 40 years.

He agrees with the NoSQL school that OldSQL (think Oracle, MS SQL Server &c.) is “old technology” and needs to be “sent to the home for retired software“, and that OldSQL (in this case MySQL) has trapped Facebook in a “fate worse than death”. His point about NewSQL is that for OLTP apps, you need a shared-nothing sharded architecture (check out his VoltDB) and that for OLAP you need dedicated columnar stores, i.e. Vertica (which he sold to HP).

If this doesn’t convince you, check out Brian Aker‘s (former MySQL chief architect) humourous take on NoSQL here.

As for large datasets, I would urge you to Google in the area which particularly interests you. I know that metereological (my dad was one) datasets can be very large and also genomic datasets can also be huge (I studied genetics in uni). This site seems to be right up your alley – with many multi-GB and multi-TB datasets.

[EDIT] Other sites of interest are to be found (1, 2, 3, 4 & 5).

I would strongly urge you to benchmark both RDBMS and NoSQL solutions. As I mentioned, Dziuba says NoSQL is niche – it may suit your particular needs, I don’t know. 2GB datasets are now officially small (even tiny – they fit easily on all memory sticks. Nowadays, you’ve got to be getting into the multiple terabyte region for a database to be getting large. Consider the Apollo moon landing’s IT capacity. There was time when 2GB was huge – no so anymore!

Finally, I leave the last word to Ted Dziuba:

“I’m not just singling out Cassandra – by replacing MySQL or Postgres with a different, new data store, you have traded a well-enumerated list of limitations and warts for a newer, poorly understood list of limitations and warts, and that is a huge business risk”.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply