The GROUP BY
clause
GROUP BY
merges output rows that have the same combination of values in its item list into a single row.Aggregate functions in the select list are applied to each group individually instead of to the dataset as a whole.
If the select list only contains aggregate columns or, more generally, columns whose values don’t depend on individual rows in the underlying set, GROUP BY
is optional.When omitted, the final result set consists of a single row (provided that at least one aggregated column is present).
If the select list contains both aggregate columns and columns whose values may vary per row, the GROUP BY
clause becomes mandatory.
SELECT ... FROM ... GROUP BY <grouping-item> [, <grouping-item> ...] [HAVING <grouped-row-condition>] ... <grouping-item> ::= <non-aggr-select-item> | <non-aggr-expression> <non-aggr-select-item> ::= column-copy | column-alias | column-position
Argument | Description |
---|---|
<grouping-item> |
Expression to group on;in the rest of this chapter, we use <value-expression> in |
non-aggr-expression |
Any non-aggregating expression that is not included in the |
column-copy |
A literal copy, from the |
column-alias |
The alias, from the |
column-position |
The position number, in the |
A general rule of thumb is that every non-aggregate item in the SELECT
list must also be in the GROUP BY
list.You can do this in three ways:
-
By copying the item verbatim from the select list, e.g. “
class
” or “'D:' || upper(doccode)
”. -
By specifying the column alias, if it exists.
-
By specifying the column position as an integer literal between 1 and the number of columns.Integer values resulting from expressions or parameter substitutions are simple constant values and not column position and will be used as such in the grouping.They will have no effect though, as their value is the same for each row.
Note
|
If you group by a column position, the expression at that position is copied internally from the select list.If it concerns a subquery, that subquery will be executed again in the grouping phase.That is to say, grouping by the column position, rather than duplicating the subquery expression in the grouping clause, saves keystrokes and bytes, but it is not a way of saving processing cycles! |
In addition to the required items, the grouping list may also contain:
-
Columns from the source table that are not in the select list, or non-aggregate expressions based on such columns.Adding such columns may further subdivide the groups.However, since these columns are not in the select list, you can’t tell which aggregated row corresponds to which value in the column.So, in general, if you are interested in this information, you also include the column or expression in the select list — which brings you back to the rule: “every non-aggregate column in the select list must also be in the grouping list”.
-
Expressions that aren’t dependent on the data in the underlying set, e.g. constants, context variables, single-value non-correlated subselects etc.This is only mentioned for completeness, as adding such items is utterly pointless: they don’t affect the grouping at all.“Harmless but useless” items like these may also figure in the select list without being copied to the grouping list.
Examples
When the select list contains only aggregate columns, GROUP BY
is not mandatory:
select count(*), avg(age) from students
where sex = 'M';
This will return a single row listing the number of male students and their average age.Adding expressions that don’t depend on values in individual rows of table STUDENTS
doesn’t change that:
select count(*), avg(age), current_date from students
where sex = 'M';
The row will now have an extra column showing the current date, but other than that, nothing fundamental has changed.A GROUP BY
clause is still not required.
However, in both the above examples it is allowed.This is perfectly valid:
select count(*), avg(age) from students
where sex = 'M'
group by class;
This will return a row for each class that has boys in it, listing the number of boys and their average age in that particular class.(If you also leave the current_date
field in, this value will be repeated on every row, which is not very exciting.)
The above query has a major drawback though: it gives you information about the different classes, but it doesn’t tell you which row applies to which class.To get that extra bit of information, add the non-aggregate column CLASS
to the select list:
select class, count(*), avg(age) from students
where sex = 'M'
group by class;
Now we have a useful query.Notice that the addition of column CLASS
also makes the GROUP BY
clause mandatory.We can’t drop that clause anymore, unless we also remove CLASS
from the column list.
The output of our last query may look something like this:
CLASS | COUNT | AVG |
---|---|---|
2A |
12 |
13.5 |
2B |
9 |
13.9 |
3A |
11 |
14.6 |
3B |
12 |
14.4 |
… |
… |
… |
The headings “COUNT” and “AVG” are not very informative.In a simple case like this, you might get away with that, but in general you should give aggregate columns a meaningful name by aliasing them:
select class,
count(*) as num_boys,
avg(age) as boys_avg_age
from students
where sex = 'M'
group by class;
Adding more non-aggregate (or, row-dependent) columns requires adding them to the GROUP BY
clause too.For instance, you might want to see the above information for girls as well;and you may also want to differentiate between boarding and day students:
select class,
sex,
boarding_type,
count(*) as number,
avg(age) as avg_age
from students
group by class, sex, boarding_type;
This may give you the following result:
CLASS | SEX | BOARDING_TYPE | NUMBER | AVG_AGE |
---|---|---|---|---|
2A |
F |
BOARDING |
9 |
13.3 |
2A |
F |
DAY |
6 |
13.5 |
2A |
M |
BOARDING |
7 |
13.6 |
2A |
M |
DAY |
5 |
13.4 |
2B |
F |
BOARDING |
11 |
13.7 |
2B |
F |
DAY |
5 |
13.7 |
2B |
M |
BOARDING |
6 |
13.8 |
… |
… |
… |
… |
… |
Each row in the result set corresponds to one particular combination of the columns CLASS
, SEX
and BOARDING_TYPE
.The aggregate results — number and average age — are given for each of these groups individually.In a query like this, you don’t see a total for boys as a whole, or day students as a whole.That’s the tradeoff: the more non-aggregate columns you add, the more you can pinpoint specific groups, but the more you also lose sight of the general picture.Of course, you can still obtain the “coarser” aggregates through separate queries.