Should I use SQL vs NoSQL for files catalog?

All we need is an easy explanation of the problem, so here it is.

I will be implementing a system which will be storing 250 Million files/user. I want to perform list operations from the client side, where the client application will be fetching 100 records at a time.

What is the record I will be fetching?

  • File_name
  • last modified
  • https URI for that file stored in S3 like storage.

On the server-side should I use SQL or No-SQL to store this meta data information?

I was thinking about using an RDBMS with following schema:

  • UserID
  • recordid
  • fileName
  • timestamp
  • URI

As the fetch query just needs to return 100 records at a time, I was thinking SQL. If we save all the user information in a single row in NoSQL it would take a long time to query next-100 records Or append new files for a particular users.

Any suggestions? I am new to this, so please let me know if my question is too vague/broad and I can update it for any specific questions.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Your use case is of the simplest kind, so you’d likely see very similar performance regardless if you used a RDBMS or NoSQL system.
But here are the things you should actually consider when deciding on a database system:

  1. Do you have a well structured schema?

    Answer: Yes, it appears you do when you’re able to directly articulate the structure of that schema by specifying the fields and their data types: UserId, RecordId, FileName, Timestamp, URI

  2. Will your schema be changing at a high frequency, more than you’d be able to keep the database entity structure up to date with?

    Answer: I wouldn’t think so based on the kind of data you’re planning on storing in your files table, but that’s for you to decide. NoSQL’s flexibility to be schema-less is best for when you have a frequently changing or non-concrete schema, and as a developer you don’t want to undertake the responsibility of maintaining the changes to the structure on the database side. But if you’re ok with maintaining your database entities if / when the schema changes then an RDBMS will work just fine as well.

  3. Is your data relational?

    Answer: Yes, it sounds like it, especially when you mentioned you have "user information" as well which I assume likely relates to your files table by the UserId field.

There are other reasons you could consider as well such as cost efficiency, ease of infrastructure maintainability with scaling, sharding vs vertical scaling but these are more granular details that are available to most database systems nowadays (regardless if you pick a SQL or NoSQL solution) which are more complex and outside the general scope of when to choose a SQL or NoSQL solution.

The aforementioned questions are the main ones I think should be used to determine when to pick a RDBMS vs a NoSQL solution. And it sounds like your use case being that your schema is well defined and relational, a RDBMS would be a good choice.

Method 2

NoSQL is the right way.

Store Files in a File System, Not a Relational Database.

EDIT: for Wernfried

Hi Wernfried and thank you for your comment.

When you want to save an image in SQL Server you have to use the data type VARBINARY(MAX) and actually bring the image into the database:

INSERT INTO adventureworks.dbo.myimages
VALUES (
    1
    ,(
        SELECT *
        FROM OPENROWSET(BULK N'C:\img\1.png', SINGLE_BLOB) AS T1
        )
    )

This means that in an RDBMS, the data would be in different rows stored in different places on disk, requiring multiple disk operations for retrieval. (unless you keep reordering them time by time and that would be very panful with images and documents).

But in NoSQL you have more versatility, you can use:

  • Key-value data stores: any type of binary object (text, video, JSON document, etc.)
  • Document stores: JSON, XML, and BSON documents.
  • Wide-column stores: data in tables with rows and columns similar to RDBMS but a query can retrieve related data in a single operation
  • Graph stores: graph structures to store, map, and query relationships, so that adjacent elements are linked together without using an index.

Normally for images or documents you would go for the first two: Key-value data stores or Document stores.

And how are these two types actually stored behind the curtains?

They are simply saved on a filesystem, with only the reference to the
image or document stored in NoSQL.

That’s why many people prefer to call RDBMS "databases" and NoSQL "Search Engine" because what NoSQL simply do is just digging into terabits or petabits of JSON, log file, images, etc…

CONCLUSION

If you have the image C:\img\1.png and you want to import it into SQL Server you have to save it as VARBINARY(MAX).

But if you want to import C:\img\1.png into NoSQL you just have to tell to the engine that the images are into the folder C:\img\ and NoSQL is fine with that.

…and now we can all take out our knifes and baseball stick and start the fight.

Method 3

Your example give is non relational data, which often leads to nosql been a good fit. Each row would be a document within a nosql dB, having all ‘rows’ in a document would mean bringing a huge document each time. This would make adding new data very quick too. Also you’ll probably break size limits of the document. This way you bring back document required each time, and offsets by 10p each time. Nosql can have lower latency.

SQL can have benefits if the data is relational across many tables or needs ACID properties.

Other factors such as cost, available tech etc would be the main driving factor here for me. Along with considering any future plans. But in this instance nosql seems a good fit.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply