Scenario
Sometimes, we need to pick some data as a sample to measure the overall situation randomly. For example, a factory produces multiple products, and each product needs to pick out a certain amount of samples to measure the overall pass rate. Moreover, in the selection process, it is also necessary to exclude the factor of human choice. So, how to use Power BI to realize this requirement?
Next, let us take a concrete look at different scenarios and needs.
Note:
In this blog, we only discuss how to sample, not the subsequent process of judging whether it is qualified or not.
Here is a sample table: Table1. And you can find others in the attached .xlsx file.
Product |
Product ID |
Production Date |
Metrics |
Product A |
1001 |
2022/3/12 |
1.37 |
Product A |
1002 |
2022/3/12 |
1.69 |
Product A |
1003 |
2022/3/13 |
2.23 |
Product A |
1004 |
2022/3/14 |
1.98 |
Product B |
2001 |
2022/3/12 |
2.01 |
Product B |
2002 |
2022/3/13 |
3.04 |
Product B |
2003 |
2022/3/13 |
2.65 |
Product B |
2004 |
2022/3/14 |
1.75 |
Product B |
2005 |
2022/3/14 |
2.23 |
Product C |
3001 |
2022/3/13 |
4.11 |
Product C |
3002 |
2022/3/13 |
3.51 |
Product C |
3003 |
2022/3/13 |
3.66 |
Product C |
3004 |
2022/3/14 |
2.35 |
Product C |
3005 |
2022/3/15 |
4.01 |
Product C |
3006 |
2022/3/16 |
2.98 |
… |
… |
… |
… |
Part1: Randomly select a few rows of records from the entire table
In Power BI, there is a DAX function named SAMPLE. It will return a sample of N rows from the specified table. And if OrderBy arguments are provided, the sample will be stable and deterministic, returning the first row, the last row, and evenly distributed rows between them. If no ordering is specified, the sample will be random, neither stable nor deterministic.
Here are examples:
EX1: Return a calculated table in which 3 samples are randomly selected from Table1.
Table1_Sample_1 =
SAMPLE (
3, /* The number of rows to return. */
Table1, /* Any DAX expression that returns a table of data from where to extract the ‘n’ sample rows. */
1 /* Since the minimum argument count for SAMPLE function is 3, put any scalar DAX expression. You can also specify a column. */
)
EX2: Return a calculated table in which 3 samples are randomly selected from Table1 and sorted by one specific column.
Table1_Sample_2 =
SAMPLE (
3,
Table1,
[Metrics] /* Sort by [Value] column. */
)
In addition, from my limited testing, I find that SAMPLE function will not update the result returned after refresh if data has not been updated. Here is the test result:
Then how do we get different results after each refresh?
Let’s consider using RAND function. It works the same as in Excel, returning a random number greater than or equal to 0 and less than 1, evenly distributed. The number that is returned changes each time the cell containing this function is recalculated.
There are two methods.
a. Combined with SUMMARIZE, TOPN and ADDCOLUMNS functions to create a calculated table.
Table1_RAND =
SUMMARIZE (
TOPN ( 3, ADDCOLUMNS ( Table1, “Rand”, RAND () ), [Rand] ),
[Product],
[Product ID],
[Production Date],
[Metrics]
)
b. Create a calculated column in the original table and just return the filtered values in Table visual.
Note: To avoid affecting the test results above, we copy the Table1 below, generate Table2, and then create a new computed column in Table2.
Rand = RAND()
We can see that after each refresh, the returned rows are different.
Now, we know how to get sample rows from one table. Then, how do we get the same number of rows for each category? Let’s go to the next part.
Part2: Randomly select records with the same number of rows by category from the entire table
a. Based on my test, class-by-class sampling cannot be achieved by SAMPLE function in this scenario. For example:
Table1_Sample_3 =
VAR ProductNum_ =
DISTINCTCOUNT ( Table1[Product] )
RETURN
SAMPLE ( ProductNum_ * 2, Table1, [Product] )
b. Then we use RANK function to create measures based on the Rand column in Part1 and then use “Filters on this visual” feature.
Rank_Table2 =
RANKX (
ALLEXCEPT ( Table2, Table2[Category] ),
CALCULATE ( SUM ( Table2[Rand] ) ),
,
DESC,
DENSE
)
So far, we have implemented sampling by category. Then, if some categories have fewer records and some categories have more records. Then how do we sample it? Let’s go to Part3.
Part3: Randomly select records with different rows by category from the entire table
It is, in fact, very simple that we only need to combine the SWITCH function and add a judgment condition and then use “Filters on this visual” feature. For example:
RankFilter_Table2 =
VAR Count_ =
CALCULATE ( COUNTROWS ( Table2 ), ALLEXCEPT ( Table2, Table2[Product] ) )
VAR Threshold_ =
SWITCH ( TRUE (), Count_ >= 8, 4, Count_ >= 6, 3, Count_ >= 4, 2, 1 )
RETURN
IF ( [Rank_Table2] <= Threshold_, 1 )
If you want to get more details, please check the attached .pbix file.
Hope this blog can help you.
Author: Icey Zhang
Reviewer: Kerry Wang & Ula Huang
Be the first to comment