All we need is an easy explanation of the problem, so here it is.
I am modeling a star schema for user reviews using the yelp data set.
Each user review has a business dimension key, user dimension key and a bunch of data associated with the review. All of the review data is numeric except for the field where they put in their text for the review (stored in a column named text).
Does it make sense to store that text data in the fact table since it relates to the grain of the fact? Or should it be placed in a dimension table that grows it’s rows at the same rate as the fact table?
How to solve :
Yes it does make sense to keep the text in the fact table for two reasons :
- It is the same grain as the fact, so having a dimension just for that would indeed make it grow as fast as the fact table
- It is not linked to other attributes in the fact table, so it can be modeled as a Degenerate Dimension directly in the fact table (even though it usually applies more to an ID or a label).
Text won’t be part of your select statement when you aggregate reviews so it shouldn’t impact performance. It will be there only when showing the data at the most granular level.
If you find out you have some other low-cardinality descriptive attributes that you can’t easily put in an existing dimension because they are not linked to other attributes, you can also build a junk dimension with all these unrelated attributes.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂