Randomly filter with RAND or SAMPLE functions

Scenario

Sometimes, we need to pick some data as a sample to measure the overall situation randomly. For example, a factory produces multiple products, and each product needs to pick out a certain amount of samples to measure the overall pass rate. Moreover, in the selection process, it is also necessary to exclude the factor of human choice. So, how to use Power BI to realize this requirement?

Next, let us take a concrete look at different scenarios and needs.

Note:

In this blog, we only discuss how to sample, not the subsequent process of judging whether it is qualified or not.

Here is a sample table: Table1. And you can find others in the attached .xlsx file.

Product	Product ID	Production Date	Metrics
Product A	1001	2022/3/12	1.37
Product A	1002	2022/3/12	1.69
Product A	1003	2022/3/13	2.23
Product A	1004	2022/3/14	1.98
Product B	2001	2022/3/12	2.01
Product B	2002	2022/3/13	3.04
Product B	2003	2022/3/13	2.65
Product B	2004	2022/3/14	1.75
Product B	2005	2022/3/14	2.23
Product C	3001	2022/3/13	4.11
Product C	3002	2022/3/13	3.51
Product C	3003	2022/3/13	3.66
Product C	3004	2022/3/14	2.35
Product C	3005	2022/3/15	4.01
Product C	3006	2022/3/16	2.98
…	…	…	…

Part1: Randomly select a few rows of records from the entire table

In Power BI, there is a DAX function named SAMPLE. It will return a sample of N rows from the specified table. And if OrderBy arguments are provided, the sample will be stable and deterministic, returning the first row, the last row, and evenly distributed rows between them. If no ordering is specified, the sample will be random, neither stable nor deterministic.

Here are examples:

EX1: Return a calculated table in which 3 samples are randomly selected from Table1.

Table1_Sample_1 =

SAMPLE (

3, /* The number of rows to return. */

Table1, /* Any DAX expression that returns a table of data from where to extract the ‘n’ sample rows. */

1 /* Since the minimum argument count for SAMPLE function is 3, put any scalar DAX expression. You can also specify a column. */

)

EX2: Return a calculated table in which 3 samples are randomly selected from Table1 and sorted by one specific column.

Table1_Sample_2 =

SAMPLE (

Table1,

[Metrics] /* Sort by [Value] column. */

)

In addition, from my limited testing, I find that SAMPLE function will not update the result returned after refresh if data has not been updated. Here is the test result:

Then how do we get different results after each refresh?

Let’s consider using RAND function. It works the same as in Excel, returning a random number greater than or equal to 0 and less than 1, evenly distributed. The number that is returned changes each time the cell containing this function is recalculated.

There are two methods.

a. Combined with SUMMARIZE, TOPN and ADDCOLUMNS functions to create a calculated table.

Table1_RAND =

SUMMARIZE (

TOPN ( 3, ADDCOLUMNS ( Table1, “Rand”, RAND () ), [Rand] ),

[Product],

[Product ID],

[Production Date],

[Metrics]

)

b. Create a calculated column in the original table and just return the filtered values in Table visual.

Note: To avoid affecting the test results above, we copy the Table1 below, generate Table2, and then create a new computed column in Table2.

Rand = RAND()

We can see that after each refresh, the returned rows are different.

Now, we know how to get sample rows from one table. Then, how do we get the same number of rows for each category? Let’s go to the next part.

Part2: Randomly select records with the same number of rows by category from the entire table

a. Based on my test, class-by-class sampling cannot be achieved by SAMPLE function in this scenario. For example:

Table1_Sample_3 =

VAR ProductNum_ =

DISTINCTCOUNT ( Table1[Product] )

RETURN

SAMPLE ( ProductNum_ * 2, Table1, [Product] )

b. Then we use RANK function to create measures based on the Rand column in Part1 and then use “Filters on this visual” feature.

Rank_Table2 =

RANKX (

ALLEXCEPT ( Table2, Table2[Category] ),

CALCULATE ( SUM ( Table2[Rand] ) ),

DESC,

DENSE

)

So far, we have implemented sampling by category. Then, if some categories have fewer records and some categories have more records. Then how do we sample it? Let’s go to Part3.

Part3: Randomly select records with different rows by category from the entire table

It is, in fact, very simple that we only need to combine the SWITCH function and add a judgment condition and then use “Filters on this visual” feature. For example:

RankFilter_Table2 =

VAR Count_ =

CALCULATE ( COUNTROWS ( Table2 ), ALLEXCEPT ( Table2, Table2[Product] ) )

VAR Threshold_ =

SWITCH ( TRUE (), Count_ >= 8, 4, Count_ >= 6, 3, Count_ >= 4, 2, 1 )

RETURN

IF ( [Rank_Table2] <= Threshold_, 1 )

If you want to get more details, please check the attached .pbix file.

Hope this blog can help you.

Author: Icey Zhang

Reviewer: Kerry Wang & Ula Huang

Source link

Randomly filter with RAND or SAMPLE functions

Be the first to comment

Leave a Reply Cancel reply