FirebirdSQL logo

The GROUP BY clause

GROUP BY merges output rows that have the same combination of values in its item list into a single row.Aggregate functions in the select list are applied to each group individually instead of to the dataset as a whole.

If the select list only contains aggregate columns or, more generally, columns whose values don’t depend on individual rows in the underlying set, GROUP BY is optional.When omitted, the final result set consists of a single row (provided that at least one aggregated column is present).

If the select list contains both aggregate columns and columns whose values may vary per row, the GROUP BY clause becomes mandatory.

Syntax
SELECT ... FROM ...
  GROUP BY <grouping-item> [, <grouping-item> ...]
  [HAVING <grouped-row-condition>]
  ...

<grouping-item> ::=
    <non-aggr-select-item>
  | <non-aggr-expression>

<non-aggr-select-item> ::=
    column-copy
  | column-alias
  | column-position
Table 1. Arguments for the GROUP BY Clause
Argument Description

<grouping-item>

Expression to group on;in the rest of this chapter, we use <value-expression> in GROUP BY syntax

non-aggr-expression

Any non-aggregating expression that is not included in the SELECT list, i.e. unselected columns from the source set or expressions that do not depend on the data in the set at all

column-copy

A literal copy, from the SELECT list, of an expression that contains no aggregate function

column-alias

The alias, from the SELECT list, of an expression (column) that contains no aggregate function

column-position

The position number, in the SELECT list, of an expression (column) that contains no aggregate function

A general rule of thumb is that every non-aggregate item in the SELECT list must also be in the GROUP BY list.You can do this in three ways:

  1. By copying the item verbatim from the select list, e.g. “class” or “'D:' || upper(doccode)”.

  2. By specifying the column alias, if it exists.

  3. By specifying the column position as an integer literal between 1 and the number of columns.Integer values resulting from expressions or parameter substitutions are simple constant values and not column position and will be used as such in the grouping.They will have no effect though, as their value is the same for each row.

Note

If you group by a column position, the expression at that position is copied internally from the select list.If it concerns a subquery, that subquery will be executed again in the grouping phase.That is to say, grouping by the column position, rather than duplicating the subquery expression in the grouping clause, saves keystrokes and bytes, but it is not a way of saving processing cycles!

In addition to the required items, the grouping list may also contain:

  • Columns from the source table that are not in the select list, or non-aggregate expressions based on such columns.Adding such columns may further subdivide the groups.However, since these columns are not in the select list, you can’t tell which aggregated row corresponds to which value in the column.So, in general, if you are interested in this information, you also include the column or expression in the select list — which brings you back to the rule: “every non-aggregate column in the select list must also be in the grouping list”.

  • Expressions that aren’t dependent on the data in the underlying set, e.g. constants, context variables, single-value non-correlated subselects etc.This is only mentioned for completeness, as adding such items is utterly pointless: they don’t affect the grouping at all.“Harmless but useless” items like these may also figure in the select list without being copied to the grouping list.

Examples

When the select list contains only aggregate columns, GROUP BY is not mandatory:

select count(*), avg(age) from students
  where sex = 'M';

This will return a single row listing the number of male students and their average age.Adding expressions that don’t depend on values in individual rows of table STUDENTS doesn’t change that:

select count(*), avg(age), current_date from students
  where sex = 'M';

The row will now have an extra column showing the current date, but other than that, nothing fundamental has changed.A GROUP BY clause is still not required.

However, in both the above examples it is allowed.This is perfectly valid:

select count(*), avg(age) from students
  where sex = 'M'
  group by class;

This will return a row for each class that has boys in it, listing the number of boys and their average age in that particular class.(If you also leave the current_date field in, this value will be repeated on every row, which is not very exciting.)

The above query has a major drawback though: it gives you information about the different classes, but it doesn’t tell you which row applies to which class.To get that extra bit of information, add the non-aggregate column CLASS to the select list:

select class, count(*), avg(age) from students
  where sex = 'M'
  group by class;

Now we have a useful query.Notice that the addition of column CLASS also makes the GROUP BY clause mandatory.We can’t drop that clause anymore, unless we also remove CLASS from the column list.

The output of our last query may look something like this:

CLASS COUNT AVG

2A

12

13.5

2B

9

13.9

3A

11

14.6

3B

12

14.4

…​

…​

…​

The headings “COUNT” and “AVG” are not very informative.In a simple case like this, you might get away with that, but in general you should give aggregate columns a meaningful name by aliasing them:

select class,
       count(*) as num_boys,
       avg(age) as boys_avg_age
  from students
  where sex = 'M'
  group by class;

Adding more non-aggregate (or, row-dependent) columns requires adding them to the GROUP BY clause too.For instance, you might want to see the above information for girls as well;and you may also want to differentiate between boarding and day students:

select class,
       sex,
       boarding_type,
       count(*) as number,
       avg(age) as avg_age
  from students
  group by class, sex, boarding_type;

This may give you the following result:

CLASS SEX BOARDING_TYPE NUMBER AVG_AGE

2A

F

BOARDING

9

13.3

2A

F

DAY

6

13.5

2A

M

BOARDING

7

13.6

2A

M

DAY

5

13.4

2B

F

BOARDING

11

13.7

2B

F

DAY

5

13.7

2B

M

BOARDING

6

13.8

…​

…​

…​

…​

…​

Each row in the result set corresponds to one particular combination of the columns CLASS, SEX and BOARDING_TYPE.The aggregate results — number and average age — are given for each of these groups individually.In a query like this, you don’t see a total for boys as a whole, or day students as a whole.That’s the tradeoff: the more non-aggregate columns you add, the more you can pinpoint specific groups, but the more you also lose sight of the general picture.Of course, you can still obtain the “coarser” aggregates through separate queries.

HAVING

Just as a WHERE clause limits the rows in a dataset to those that meet the search condition, so the HAVING sub-clause imposes restrictions on the aggregated rows in a grouped set.HAVING is optional, and can only be used in conjunction with GROUP BY.

The condition(s) in the HAVING clause can refer to:

  • Any aggregated column in the select list.This is the most widely used case.

  • Any aggregated expression that is not in the select list, but allowed in the context of the query.This is sometimes useful too.

  • Any column in the GROUP BY list.While legal, it is more efficient to filter on these non-aggregated data at an earlier stage: in the WHERE clause.

  • Any expression whose value doesn’t depend on the contents of the dataset (like a constant or a context variable).This is valid but not useful, because it will either suppress the entire set or leave it untouched, based on conditions that have nothing to do with the set itself.

A HAVING clause can not contain:

  • Non-aggregated column expressions that are not in the GROUP BY list.

  • Column positions.An integer in the HAVING clause is just an integer, not a column position.

  • Column aliases –- not even if they appear in the GROUP BY clause!

Examples

Building on our earlier examples, this could be used to skip small groups of students:

select class,
       count(*) as num_boys,
       avg(age) as boys_avg_age
  from students
  where sex = 'M'
  group by class
  having count(*) >= 5;

To select only groups that have a minimum age spread:

select class,
       count(*) as num_boys,
       avg(age) as boys_avg_age
  from students
  where sex = 'M'
  group by class
  having max(age) - min(age) > 1.2;

Notice that if you’re interested in this information, you’ll likely also include min(age) and max(age) — or the expression “max(age) - min(age)”.

To include only 3rd classes:

select class,
       count(*) as num_boys,
       avg(age) as boys_avg_age
  from students
  where sex = 'M'
  group by class
  having class starting with '3';

Better would be to move this condition to the WHERE clause:

select class,
       count(*) as num_boys,
       avg(age) as boys_avg_age
  from students
  where sex = 'M' and class starting with '3'
  group by class;