All we need is an easy explanation of the problem, so here it is.
I have a column containing text in postgresql. Now I want to find, for a given input word, the counts of occurrences of the word in my text column.
So I have
date_col | text_col
-------------+-----------------------
2021-04-02 | This is a test.
-------------+---------------------
2021-03-30 | A test is a test.
-------------+---------------------
2021-03-30 | How to test?
-------------+---------------------
2021-04-01 | One more test
and I want this result for the word 'test'
:
count | num_occurrences
-------+-----------------
3 | 1
-------+-----------------
1 | 2
meaning 3 times there was excactly one occurrence of "test"
. Once there were two occurrences of "test"
in the same row.
Later, I want to be able to query a given period with the same query.
My initial take was to create a new table with a row for every word like so:
date_col | word
------------+------------------
2021-04-02 | This
------------+------------------
2021-04-02 | is
------------+------------------
2021-04-02 | a
and do some counting and grouping. But is there a better way?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Method 1
You need to first split each string into multiple words and count the number of times each word occurs inside the string. Splitting can be done using regexp_split_to_table()
which yields one row per word:
select w.word, w.num_occurrences
from the_table t
cross join lateral (
select word, count(*) as num_occurrences
from regexp_split_to_table(lower(t.text_col), '[\s[:punct:]]+') as x(word)
where word <> ''
group by word
) w
This returns the following given your sample data:
word | num_occurrences
-----+----------------
test | 1
a | 1
is | 1
this | 1
test | 2
a | 2
is | 1
test | 1
how | 1
to | 1
test | 1
more | 1
one | 1
This can be grouped by num_occurences to get the result you want:
select count(*), num_occurrences
from (
select w.word, w.num_occurrences
from the_table t
cross join lateral (
select word, count(*) as num_occurrences
from regexp_split_to_table(lower(t.text_col), '[\s[:punct:]]+') as x(word)
where word <> ''
group by word
) w
where w.word = 'test'
) t
group by num_occurrences
order by 1 desc
And the result of that is:
count | num_occurrences
------+----------------
3 | 1
1 | 2
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0