Make IN clause behave like AND clause

All we need is an easy explanation of the problem, so here it is.

I’ve asked the same question on stackoverlow, but got no answer, so that’s why I’m asking it here (I’ve read that simple sql questions should be asked at SO, but I have no choice).

Suppose I have 3 tables: posts, post_categories and categories. I’m implementing some kind of page with filters where I can filter posts by categories. User can select multiple categories. When he selects more than one, they should be summed. I can’t get it working. What I have now is simple IN() SQL clause:

SELECT, post_categories.category_id FROM posts 
JOIN post_categories ON = post_categories.post_id 
WHERE post_categories.category_id IN (1,2,3) LIMIT 10;

But it is not matching all ids, it is OR and I need AND. What I need here is to find all posts that have categories with id=1 AND id=2 AND id=3, but instead this query returns posts that have at least one category from the IN() list.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

The IN() operator is equivalent to a series of OR, which doesn’t help you at all here.

Instead, I would build the query in a manner like this.

FROM posts 
WHERE IN (SELECT post_id FROM post_categories WHERE category_id=1) AND IN (SELECT post_id FROM post_categories WHERE category_id=2) AND IN (SELECT post_id FROM post_categories WHERE category_id=3)

Admittedly, not a pretty construct, but I wrote it for readability. The following query might perform better:

FROM posts
-- include:
        SELECT post_id
        FROM post_categories
        WHERE category_id IN (1, 2, 3)
        GROUP BY post_id
        HAVING COUNT(*)=3)
-- exclude:
    AND NOT IN (
        SELECT post_id
        FROM post_categories
        WHERE category_id NOT IN (1, 2, 3))

The second query assumes that the primary key of post_categories is (post_id, category_id). Since you haven’t specified which rdbms you’re running, you may have to tweak my code a bit to make it run.

Edit: The -- exclude: part eliminates posts that have any other category_id than 1, 2 or 3. You may want to skip this part depending on if you want to return a) all posts that have categories 1, 2 and 3 or b) all posts that have exactly categories 1, 2 and 3 and no other categories.

Method 2

IN is like writing = ANY, e.g.

regress=> SELECT 1 = ANY (ARRAY[1, 2, 3]);
(1 row)

A simplistic interpretation of your question would be to say that you’re asking for = ALL, but that rarely makes sense:

regress=> SELECT 1 = ALL (ARRAY[1, 2, 3]);
(1 row)

What you really want is to find all posts that have a corresponding post_categories entry for each of a listed set of category_id.

As is usual in relational databases, as soon as you clearly frame the question, the solution starts writing its self.

You could do a multiple join, like Daniel suggests, but that’s a pain because it requires dynamic SQL. Or you could use relational set operations.

Instead, I’d possibly use PostgreSQL’s array features. This would be easier if you’d provided sample data, but I think you want something like the following, which finds each category that matches the post then does a check to see if all the categories match (the HAVING clause):

FROM posts 
INNER JOIN post_categories ON = post_categories.post_id 
WHERE post_categories.category_id IN (1,2,3)
HAVING array_agg(post_categories.category_id) @> ARRAY[1,2,3]

Untested, since you didn’t provide sample data in CREATE TABLE and INSERTs form, but that’s the general idea. @> means “array-contains”; I suspect it’ll be more efficient than using array_agg(category_id ORDER BY category_id) and an equality test against a sorted ARRAY literal, because you avoid the sort on aggregation.

You could leave out the WHERE clause and this would still work, but it might be quite slow, as it’d fail to filter out posts in which none of the categories matched before doing aggregation.

BTW, LIMIT without ORDER BY is usually a bug. The database can return any set of 10 matching results it feels like, in any order.

Method 3

Most importantly, this is a special case of relational division. Once you know the name of the beast you’ll find plenty of query techniques. Like the arsenal we assembled on SO:

Building on this test case (which you should have provided):

   post_id serial PRIMARY KEY
 , post text NOT NULL

CREATE TABLE category (
   category_id serial PRIMARY KEY
 , category text NOT NULL

CREATE TABLE post_category (
   post_id int REFERENCES post
 , category_id int REFERENCES category
 , PRIMARY KEY (post_id, category_id)

Since I have to provide my own test case I am using proper table and column names. I prefer singular terms for table names where each row represents a single entity. And I avoid the unhelpful "id" as column name.

The special requirement in your case is to make it work with a dynamic set of categories, I suggest two nested EXISTS anti-semi-joins. Should be the fastest and IMO also most elegant way.

This is the task how I understand it after reading the question a couple of times:

“Find all posts that have all of the given categories attached.”
Which can be expressed in its inverted form:
“Find all posts where none of the given categories is missing.”

Short form with given set of category_id:

SELECT p.*                                   -- "Find all posts ..."
FROM   post p
WHERE  NOT EXISTS (                          -- "where none of the ..."
   FROM  (VALUES (1),(2),(3)) c(category_id) -- "given categories ..."
   WHERE  NOT EXISTS (                       -- "is missing"
      SELECT 1
      FROM   post_category pc
      WHERE  pc.post_id = p.post_id
      AND    pc.category_id = c.category_id

Or, retrieving IDs from the category table first:

-- posts with red & green
FROM   post p
   FROM   category c
   WHERE  c.category IN ('red', 'green') -- retrieve IDs from cat table
      SELECT 1
      FROM   post_category pc
      WHERE  pc.post_id = p.post_id
      AND    pc.category_id = c.category_id

SQL Fiddle demonstrating all.

Method 4

SCOPES can help your out for this…..for example:-

in post

 has_many :post_categories
 has_many :categories
scope :recent_posts, where('created_at < ?', 1.days.ago).order('created_at ASC').limit(10)
scope :recent_posts_with_good_categories, includes(:post_categories).where("", "GOOD").merge(Post.recent_post)
scope :recent_posts_with_bad_categories, includes(:post_categories).where("", "BAD").merge(Post.recent_post)
scope :recent_posts_with_good_bad__evil_categories, includes(:post_categories).where(" in ?", ["GOOD","BAD","EVIL"]).merge(Post.recent_post)

####similarly you can add multiple scopes in other models using includes and merge to get summed results

in post_category

 belongs-to :post
 has_many :categories

similar scopes can come here as well

in category

     belongs-to :post
     belongs-to :post_category
###similar scopes can come here as well

the same can be achieved using class methods and chain them the way u want:-


    def self.find_users_with_categories(t)
      ##here you can query with all associated models as they are included
      Post.all(:joins => :post_categories,:categories, :select => "posts.*,",:group => "").where(" IN (?)",t)           

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from or, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply