How to find the counts of word occurrences in a column?

All we need is an easy explanation of the problem, so here it is.

I have a column containing text in postgresql. Now I want to find, for a given input word, the counts of occurrences of the word in my text column.

So I have

  date_col   |  text_col
-------------+-----------------------
 2021-04-02  | This is a test.
-------------+---------------------
 2021-03-30  | A test is a test.
-------------+---------------------
 2021-03-30  | How to test?
-------------+---------------------
 2021-04-01  | One more test

and I want this result for the word 'test':

 count | num_occurrences
-------+-----------------
   3   |  1
-------+-----------------
   1   |  2

meaning 3 times there was excactly one occurrence of "test". Once there were two occurrences of "test" in the same row.

Later, I want to be able to query a given period with the same query.

My initial take was to create a new table with a row for every word like so:

  date_col  |  word 
------------+------------------
 2021-04-02 |  This
------------+------------------
 2021-04-02 |  is
------------+------------------
 2021-04-02 |  a

and do some counting and grouping. But is there a better way?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

You need to first split each string into multiple words and count the number of times each word occurs inside the string. Splitting can be done using regexp_split_to_table() which yields one row per word:

select w.word, w.num_occurrences
from the_table t
  cross join lateral (
     select word, count(*) as num_occurrences
     from regexp_split_to_table(lower(t.text_col), '[\s[:punct:]]+') as x(word)
     where word <> ''
     group by word
  ) w

This returns the following given your sample data:

word | num_occurrences
-----+----------------
test |               1
a    |               1
is   |               1
this |               1
test |               2
a    |               2
is   |               1
test |               1
how  |               1
to   |               1
test |               1
more |               1
one  |               1

This can be grouped by num_occurences to get the result you want:

select count(*), num_occurrences 
from (
  select w.word, w.num_occurrences
  from the_table t
    cross join lateral (
       select word, count(*) as num_occurrences
       from regexp_split_to_table(lower(t.text_col), '[\s[:punct:]]+') as x(word)
       where word <> ''
       group by word
    ) w
  where w.word = 'test'
) t
group by num_occurrences
order by 1 desc

And the result of that is:

count | num_occurrences
------+----------------
    3 |               1
    1 |               2

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply