All we need is an easy explanation of the problem, so here it is.
I have a problem in transposing a large amount of data table in BigQuery (1.5 billion rows) from rows to columns. I could figure out how to do it with small amount of data when hardcoded, but with this large amount. A snapshot of the table looks like this:
| CustomerID Feature Value |
| 1 A123 3 |
| 1 F213 7 |
| 1 F231 8 |
| 1 B789 9.1 |
| 2 A123 4 |
| 2 U123 4 |
| 2 B789 12 |
| .. .. .. |
| .. .. .. |
| 400000 A123 8 |
| 400000 U123 7 |
| 400000 R231 6 |
So basically there are approximately 400,000 distinct customerID with 3000 features, and not every customerID has the same features, so some customerID may have 2000 features while some have 3000. The end result table I would like to get is each row presents one distinct customerID, and with 3000 columns that presents all the features. Like this:
CustomerID Feature1 Feature2 ... Feature3000
So some of the cells may have missing values.
Anyone has idea how to do this in BigQuery or SQL?
Thanks in advance.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
In below query replace
yourTable with real name of your table and execute/run it
SELECT 'SELECT CustomerID, ' + GROUP_CONCAT_UNQUOTED( 'MAX(IF(Feature = "' + STRING(Feature) + '", Value, NULL))' ) + ' FROM yourTable GROUP BY CustomerID' FROM (SELECT Feature FROM yourTable GROUP BY Feature)
As a result you will get some string to be used in next step!
Take string you got from Step 1 and just execute it as a query
The output is a Pivot you asked in question
Hi @Jade I posted a very similar question before. And got a very helpful (and similar) answer from @MikhailBerlyant. For what it’s worth, I had about 4000 features to dummify in my case and also ran into “Resources exceeded during query execution” error.
I think that this type of large-scale data transformation (rather than query) is better left for other tools more suitable for this task (such as Spark).
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂