sql - Ordering distinct column values by (first value of) other column in aggregate function

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I'm trying to order the output order of some distinct aggregated text based on the value of another column with something like:
string_agg(DISTINCT sometext, ' ' ORDER BY numval)
However, that results in the error:
  ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list
I do understand why this is, since the ordering would be "ill-defined" if the numval of two repeated values differs, with that of another lying in-between.
Ideally, I would like to order them by first appearance / lowest order-by value, but the ill-defined cases are actually rare enough in my data (it's mostly sequentially repeated values that I want to get rid of with the DISTINCT) that I ultimately don't particularly care about their ordering and would be happy with something like MySQL's GROUP_CONCAT(DISTINCT sometext ORDER BY numval SEPARATOR ' ') that simply works despite its sloppiness.
I expect some Postgres contortionism will be necessary, but I don't really know what the most efficient/concise way of going about this would be.
Building on DISTINCT ON
SELECT string_agg(sometext, ' ' ORDER BY numval) AS no_dupe
FROM  (
    SELECT DISTINCT ON (1,2) <whatever>, sometext, numval
    FROM   tbl
    ORDER  BY 1,2,3
    ) sub;
This is the simpler equivalent of @Gordon's query.

From your description alone I would have suggested @Clodoaldo's simpler variant.
uniq() for integer
For integer values instead of text, the additional module intarray has just the thing for you:
uniq(int[])     int[]   remove adjacent duplicates
Install it once per database with:
CREATE EXTENSION intarray;
Then the query is simply:
SELECT uniq(array_agg(some_int ORDER BY <whatever>, numval)) AS no_dupe
FROM   tbl;
Result is an array, wrap it in array_to_string() if you need a string.
Related:
How to create an index for elements of an array in PostgreSQL?
Compare arrays for equality, ignoring order of elements
In fact, it wouldn't be hard to create a custom aggregate function to do the same with text ...
Custom aggregate function for any data type
Function that only adds next element to array if it is different from the previous. (NULL values are removed!):
CREATE OR REPLACE FUNCTION f_array_append_uniq (anyarray, anyelement)
  RETURNS anyarray
  LANGUAGE sql STRICT IMMUTABLE AS
'SELECT CASE WHEN $1[array_upper($1, 1)] = $2 THEN $1 ELSE $1 || $2 END';
Using polymorphic types to make it work for any scalar data-type.
Custom aggregate function:
CREATE AGGREGATE array_agg_uniq(anyelement) (
   SFUNC = f_array_append_uniq
 , STYPE = anyarray
 , INITCOND = '{}'
Call:
SELECT array_to_string(
          array_agg_uniq(sometext ORDER BY <whatever>, numval)
        , ' ') AS no_dupe
FROM   tbl;
Note that the aggregate is PARALLEL UNSAFE (default) by nature, even though the transition function could be marked PARALLEL SAFE.
Related answer:
Custom PostgreSQL aggregate for circular average
                I couldn't get this to work with Postgres 13 (PostgreSQL 13.0 (Debian 13.0-1.pgdg100+1)); it just returns an empty array for all inputs. Ex: select b, array_agg_uniq(a+c order by a+c) from generate_series(1, 10) a, generate_series(2, 5) b, generate_series(2, 5) c group by b; although I actually intend to use it for strings. I think the problem is that array_upper('{}', 1) is null, which propagates through the $1[idx]. SELECT CASE WHEN array_length($1, 1) <> 0 THEN CASE WHEN $1[array_upper($1, 1)] <> $2 THEN $1 || $2 ELSE $1 END ELSE ARRAY[$2] END works, but doesn't seem great?
– Shabble
                Mar 30, 2021 at 18:30
                I think I got it to work with: CREATE OR REPLACE FUNCTION f_array_append_uniq (anyarray, anyelement) RETURNS anyarray AS $func$ SELECT CASE WHEN $1[array_upper($1, 1)] = $2 THEN $1 ELSE $1 || $2 END $func$ LANGUAGE sql STRICT IMMUTABLE; (swapping clauses of conditional, and adding strict to ignore null inputs.
– Shabble
                Mar 30, 2021 at 18:49
Eliminate the need to do a distinct by pre aggregating
select string_agg(sometext, ' ' order by numval)
from (
    select sometext, min(numval) as numval
    from t
    group by sometext
@Gordon's answer brought a good point. That is if there are other needed columns. In this case a distinct on is recommended 
select x, string_agg(sometext, ' ' order by numval)
from (
    select distinct on (sometext) *
    from t
    order by sometext, numval
group by x
                I'm afraid this doesn't really do what one would expect if sometext and numval together are not unique across the ungrouped rows, which is the case for me.
– Dologan
                Aug 7, 2014 at 13:00
What I've ended up doing is to avoid using DISTINCT altogether and instead opted to use regular expression substitution to remove sequentially repeated entries (which was my main goal) as follows:
regexp_replace(string_agg(sometext, ' ' ORDER BY numval), 
               '(\y\w+\y)(?:\s+\1)+', '\1', 'g')
This doesn't remove repeats if the external ordering leads to another entry coming in between them, but this works for me, probably even better. It may be a bit slower than other options, but I find it speedy enough for my purposes.
                As it turns out, PostgreSQL versions earlier than 9.3.5 down to 9.2 at least have a memory leak on regular expressions including back-references (such as mine), and can gobble up a lot of memory for the lifetime of the session. This has recently been fixed
– Dologan
                Aug 7, 2014 at 14:26
If this is part of a larger expression, it might be inconvenient to do a select distinct in a subquery.  In this case, you can take advantage of the fact that string_agg() ignores NULL input values and do something like:
select string_agg( (case when seqnum = 1 then sometext end) order by numval)
from (select sometext, row_number() over (partition by <whatever>, sometext order by numval) as seqnum
      from t
group by <whatever>
The subquery adds a column but does not require aggregating the data.
                This would work, except that the partition by needs to include the original grouping columns (i.e. the <whatever>) before sometext for it to work as intended; otherwise every instance of sometext but the first one gets lost on the aggregate text.
– Dologan
                Aug 7, 2014 at 13:13
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Collectives™ on Stack Overflow

Building on `DISTINCT ON`

`uniq()` for integer

Custom aggregate function for any data type