Select Distinct Rows Ordered by Another Column — …


Consider a table like this:

 ID   Year   Value   Tag 
1 2016 250 C
1 2019 190 B
2 2017 335 D
2 2018 155 A
2 2018 155 A
3 2021 230 B
3 2000 105 D
4 2019 160 C

 

Here’s the M query in case you’d like to follow along in your own query editor:

 

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUTIyMDQDUaYGQNJZKVYHLmwJpAwtQcJOYGEjiLA5kDI2NgWSLsjCFiDVpiBhR3zCxmBhI7AVxgizIcIGIAFDA4TZJkguMYM6MBYA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Year = _t, Value = _t, Tag = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Year", Int64.Type}, {"Value", type number}, {"Tag", type text}})
in
    #"Changed Type"

 

 

The goal is to select only the rows that correspond to the most recent year, one for each distinct ID.

 ID   Year   Value   Tag 
1 2019 190 B
2 2018 155 A
3 2021 230 B
4 2019 160 C

 

Approach #1: Create a new column then filter

If you’re in the mindset of DAX, it might occur to you to create a calculated column [IsMaxYear] to check, for each row, whether or not the current row matches the max year for the ID in the current row and then filter for [IsMaxYear] = TRUE.

 

We can indeed do this by adding a custom column and filter steps like this:

 

 

= Table.AddColumn(
    #"Changed Type",
    "IsMaxYear",
    (r) => List.Max(
               Table.SelectRows(#"Changed Type", each [ID] = r[ID])[Year]
           ) = r[Year],
    type logical
  ),
= Table.SelectRows(#"Added Custom", each ([IsMaxYear] = true))

 

 

Note: I’m choosing to omit a detailed explanation of how this function works since it takes us off-topic and isn’t a good solution anyway.

 

There are a couple of major drawbacks to this approach. First, it requires more than a surface-level knowledge of M to fully understand. Second, it’s slow and inefficient for large data sets since for every single row, it’s doing operations on the entire table from the previous step.

 

Approach #2: Sort and remove duplicates

If you were following along in your own query editor for the prior approach, you might have noticed that it did not eliminate duplicates where ties were involved (i.e. ID = 2 still had two identical rows even after filtering). So why not start by removing duplicates since we might need to use it anyway?

 

Using the GUI Home tab > Remove Rows > Remove Duplicates on the ID column allows us to select only the first appearance of each ID. Since the first one isn’t necessarily the one we want, we need to sort by ID and Year first to get this to work.

 

We can attempt this as follows:

 

 

#"Sorted Rows" = Table.Sort(#"Changed Type",{{"ID", Order.Ascending}, {"Year", Order.Descending}}),
#"Removed Duplicates" = Table.Distinct(#"Sorted Rows", {"ID"})

 

 

However, this does not always work. For technical reasons (see here and here for a bit more discussion), Table.Distinct doesn’t actually apply the sort to the table internally unless we force it to. Luckily, we can use Table.Buffer to force it to save the table to memory after sorting so that it does work.

 

We can either add Table.Buffer as a new intermediate step or wrap it around the definition of or reference to #”Sorted Rows”. I’m arbitrarily picking the last option here:

 

 

#"Sorted Rows" = Table.Sort(#"Changed Type",{{"ID", Order.Ascending}, {"Year", Order.Descending}}),
#"Removed Duplicates" = Table.Distinct(Table.Buffer(#"Sorted Rows"), {"ID"})

 

 

This is a decent method. However, the drawback is that it requires Table.Buffer, which means we have to load the whole table into memory and break query folding that would occur if we were querying from another source like a SQL Server Database.

 

Approach #3: Custom Group By

The basis for this approach starts with a step generated by Home > Group By.

AlexisOlson_0-1634580083804.png

Note: I discovered this method when someone linked me to this article.

 

The corresponding M code looks like this:

 

 

= Table.Group(#"Changed Type", {"ID"}, {{"Year", each List.Max([Year]), type nullable number}, {"Tbl", each _, type table}})

 

 

Each of the tables in the Tbl column is the subtable corresponding to the ID in that row. For example, the preview for the second row looks like this:

AlexisOlson_1-1634580282148.png

This is a powerful approach that leads to a variety of possibilities for working with the Tbl column.

 

We can modify last part of the GUI-generated code for the Tbl column to only take the top row for each table ordered by Year, in which case, the result is a Record, rather than a table:

 

 

{"Tbl", each Table.Max(_, "Year"), type record}

 

 

AlexisOlson_2-1634580622490.png

 

From here, we can click the expand button in the upper right and choose the ones we want.

 

 

= Table.ExpandRecordColumn(#"Grouped Rows", "Tbl", {"Value", "Tag"}, {"Value", "Tag"})

 

 

 

In the special case where you only have a single column to expand, then you can combine these two steps:

 

 

#"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"Year", each List.Max([Year]), type number}, {"Tbl", each Table.Max(_, "Year"), type record}}),
#"Expanded Tbl" = Table.ExpandTableColumn(#"Grouped Rows", "Tbl", {"Value"}, {"Value"})

 

 

into a single step by specifying which column to return in the grouping step (by appending the column name after Table.Max):

 

 

Table.Group(#"Changed Type", {"ID"}, {{"Year", each List.Max([Year]), type number}, {"Value", each Table.Max(_, "Year")[Value], type number}})

 

 

This method is also decent but, like the prior method, also breaks query folding, which can be a deal-breaker if you’re loading large tables.

 

Approach #4: Group by and merge with an earlier step

If you work with SQL frequently, this might be what you immediately think of. As with the previous post, we start with a standard Group By on ID, taking the maximum over Year (and nothing else).

AlexisOlson_3-1634581686602.png

This time, instead of tinkering with a column of tables, we’ll do a self-merge. Click Home > Merge Queries > Merge Queries, hold down Ctrl or Shift to select both columns, and choose to merge with the current table (both columns again).

 

This will generate a step with M code like this:

 

 

= Table.NestedJoin(#"Grouped Rows", {"ID", "Year"}, #"Grouped Rows", {"ID", "Year"}, "Grouped Rows", JoinKind.LeftOuter)

 

 

What we actually want though is to merge with the step from before the grouping, so we modify the second table reference to

 

 

= Table.NestedJoin(#"Grouped Rows", {"ID", "Year"}, #"Changed Type", {"ID", "Year"}, "Grouped Rows", JoinKind.LeftOuter)

 

 

Note: This self-merge (self-join) and modify the step reference is handy for all sorts of queries where you want a common table expression without defining a separate query.

 

This is similar to the situation in the previous approach except that we haven’t broken query folding (yet) and duplicate rows have not (yet) been resolved:

AlexisOlson_4-1634582217763.png

If (unlike the above) there are no ties (duplicates) anywhere, then all that remains is to expand the columns we want.

 

 

= Table.ExpandTableColumn(#"Merged Queries", "Grouped Rows", {"Value", "Tag"}, {"Value", "Tag"})

 

 

If there are duplicates, we can use Remove Rows > Remove Duplicates on the first two columns as a final step. This does break query folding, unlike Remove Duplicates performed on all of the columns. This might not be a problem since we’re working with a smaller result than we started with.

 

We can preserve query folding using Group By again, grouping on ID and Year, and aggregating the remaining columns as we wish (I used Max).

 

 

= Table.Group(#"Expanded Grouped Rows", {"ID", "Year"}, {{"Value", each List.Max([Value]), type number}, {"Tag", each List.Max([Tag]), type text}})

 

 

 

This last approach, while not the simplest, does work better than the previous ones when loading a large table from a SQL Server since it can be folded into a single SQL statement that is evaluated on the server rather than within the query editor.

 

Right-clicking on the last step and choosing View Native Query, you’ll see how it’s interpreted in SQL (assuming, of course, you connected to a SQL source rather than using my example data):

 

 

select [rows].[ID2] as [ID],
    [rows].[Year] as [Year],
    max([rows].[Amount]) as [Amount],
    max([rows].[Tag]) as [Tag]
from 
(
    select [$Outer].[ID2],
        [$Outer].[Year],
        [$Inner].[Amount],
        [$Inner].[Tag]
    from 
    (
        select [rows].[ID] as [ID2],
            max([rows].[Year]) as [Year]
        from [dbo].[TableName] as [rows]
        group by [ID]
    ) as [$Outer]
    left outer join [dbo].[TableName] as [$Inner] on (([$Outer].[ID2] = [$Inner].[ID] or [$Outer].[ID2] is null and [$Inner].[ID] is null) and [$Outer].[Year] = [$Inner].[Year])
) as [rows]
group by [ID2], [Year]

 

 

 

Here’s the full M code for this last version:

 

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUTIyMDQDUaYGQNJZKVYHLmwJpAwtQcJOYGEjiLA5kDI2NgWSLsjCFiDVpiBhR3zCxmBhI7AVxgizIcIGIAFDA4TZJkguMYM6MBYA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Year = _t, Value = _t, Tag = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Year", Int64.Type}, {"Value", type number}, {"Tag", type text}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"Year", each List.Max([Year]), type nullable number}}),
    #"Merged Queries" = Table.NestedJoin(#"Grouped Rows", {"ID", "Year"}, #"Changed Type", {"ID", "Year"}, "Grouped Rows", JoinKind.LeftOuter),
    #"Expanded Grouped Rows" = Table.ExpandTableColumn(#"Merged Queries", "Grouped Rows", {"Value", "Tag"}, {"Value", "Tag"}),
    #"Grouped Rows1" = Table.Group(#"Expanded Grouped Rows", {"ID", "Year"}, {{"Value", each List.Max([Value]), type nullable number}, {"Tag", each List.Max([Tag]), type nullable text}})
in
    #"Grouped Rows1"

 

 

 

Please let me know of other sources and methods you’ve seen or written to solve this problem. I may expand this post if I come across other interesting methods.

 

References:

https://stackoverflow.com/questions/51437962

https://exceleratorbi.com.au/remove-duplicates-keep-last-record-power-query/

https://www.ehansalytics.com/blog/2020/7/16/return-row-based-on-max-value-from-one-column-when-group…

 

Related Posts:

Selectively remove duplicate rows?

Remove duplicates – keeping the most recent row

Removing duplicates, keeping rows where a field holds a certain value

Remove duplicates by prioritizing rows based on another column

How to return the most recent value of a second column based on a first column

Power Query Editor – Remove Duplicates (selective column)



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*