Randomly filter with RAND or SAMPLE functions


Scenario

Sometimes, we need to pick some data as a sample to measure the overall situation randomly. For example, a factory produces multiple products, and each product needs to pick out a certain amount of samples to measure the overall pass rate. Moreover, in the selection process, it is also necessary to exclude the factor of human choice. So, how to use Power BI to realize this requirement?

Next, let us take a concrete look at different scenarios and needs.

Note:

In this blog, we only discuss how to sample, not the subsequent process of judging whether it is qualified or not.

Here is a sample table: Table1. And you can find others in the attached .xlsx file. 

Product

Product ID

Production Date

Metrics

Product A

1001

2022/3/12

1.37

Product A

1002

2022/3/12

1.69

Product A

1003

2022/3/13

2.23

Product A

1004

2022/3/14

1.98

Product B

2001

2022/3/12

2.01

Product B

2002

2022/3/13

3.04

Product B

2003

2022/3/13

2.65

Product B

2004

2022/3/14

1.75

Product B

2005

2022/3/14

2.23

Product C

3001

2022/3/13

4.11

Product C

3002

2022/3/13

3.51

Product C

3003

2022/3/13

3.66

Product C

3004

2022/3/14

2.35

Product C

3005

2022/3/15

4.01

Product C

3006

2022/3/16

2.98

 

Part1: Randomly select a few rows of records from the entire table

In Power BI, there is a DAX function named SAMPLE. It will return a sample of N rows from the specified table. And if OrderBy arguments are provided, the sample will be stable and deterministic, returning the first row, the last row, and evenly distributed rows between them. If no ordering is specified, the sample will be random, neither stable nor deterministic.

Here are examples:

EX1: Return a calculated table in which 3 samples are randomly selected from Table1.

Table1_Sample_1 =

SAMPLE (

    3,                  /* The number of rows to return. */

     Table1,         /* Any DAX expression that returns a table of data from where to extract the ‘n’ sample rows. */

     1                  /* Since the minimum argument count for SAMPLE function is 3, put any scalar DAX expression. You can also specify a column. */

     )

yingyinr_0-1647407908930.png

EX2: Return a calculated table in which 3 samples are randomly selected from Table1 and sorted by one specific column.

Table1_Sample_2 =

SAMPLE (

    3,

     Table1,

     [Metrics]         /* Sort by [Value] column. */

     )

yingyinr_1-1647407908935.png

In addition, from my limited testing, I find that SAMPLE function will not update the result returned after refresh if data has not been updated. Here is the test result:

yingyinr_3-1647408085579.gif

Then how do we get different results after each refresh?

Let’s consider using RAND function. It works the same as in Excel, returning a random number greater than or equal to 0 and less than 1, evenly distributed. The number that is returned changes each time the cell containing this function is recalculated.

There are two methods.

a. Combined with SUMMARIZE, TOPN and ADDCOLUMNS functions to create a calculated table.

Table1_RAND =

SUMMARIZE (

    TOPN ( 3, ADDCOLUMNS ( Table1, “Rand”, RAND () ), [Rand] ),

    [Product],

    [Product ID],

    [Production Date],

    [Metrics]

)

yingyinr_2-1647408363640.gif

b. Create a calculated column in the original table and just return the filtered values in Table visual.

Note: To avoid affecting the test results above, we copy the Table1 below, generate Table2, and then create a new computed column in Table2.

Rand = RAND()

yingyinr_3-1647408626593.gif

We can see that after each refresh, the returned rows are different.

 

Now, we know how to get sample rows from one table. Then, how do we get the same number of rows for each category? Let’s go to the next part.

 

Part2: Randomly select records with the same number of rows by category from the entire table

a. Based on my test, class-by-class sampling cannot be achieved by SAMPLE function in this scenario. For example:

Table1_Sample_3 =

VAR ProductNum_ =

    DISTINCTCOUNT ( Table1[Product] )

RETURN

    SAMPLE ( ProductNum_ * 2, Table1, [Product] )

yingyinr_4-1647408701580.pngb. Then we use RANK function to create measures based on the Rand column in Part1 and then use “Filters on this visual” feature.

Rank_Table2 =

RANKX (

    ALLEXCEPT ( Table2, Table2[Category] ),

    CALCULATE ( SUM ( Table2[Rand] ) ),

    ,

    DESC,

    DENSE

)

yingyinr_7-1647408793068.gif

So far, we have implemented sampling by category. Then, if some categories have fewer records and some categories have more records. Then how do we sample it? Let’s go to Part3.

 

Part3: Randomly select records with different rows by category from the entire table

 It is, in fact, very simple that we only need to combine the SWITCH function and add a judgment condition and then use “Filters on this visual” feature. For example:

RankFilter_Table2 =

VAR Count_ =

    CALCULATE ( COUNTROWS ( Table2 ), ALLEXCEPT ( Table2, Table2[Product] ) )

VAR Threshold_ =

    SWITCH ( TRUE (), Count_ >= 8, 4, Count_ >= 6, 3, Count_ >= 4, 2, 1 )

RETURN

    IF ( [Rank_Table2] <= Threshold_, 1 )

 

yingyinr_8-1647408887666.gif

 

If you want to get more details, please check the attached .pbix file.

Hope this blog can help you.

 

 

Author: Icey Zhang

Reviewer: Kerry Wang & Ula Huang



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*