It’s not the best practice when some structured data is stored as split by delimiter text in one column. But real life does not always equal ideal life, and I see such data sources quite often.
I’ve Python-ed a random 500k-lines csv-file (see attached) with the only column that has a structure like this:
GRP441871;Category667057;xtxtkjpdtwiazuszitjgabrz;AddInfo
GRP601591;Category807428;mweemtqqdsdinhwpfqrubqzbowju;AddInfo
GRP330108;Category994;dyitjeffwvhwzlpksjjo;AddInfo
Let’s have a look at a very common task – to get a value from a certain position in each row (and from each cell, to be more correct).
First, for example, we need to extract the third value from each row:
- xtxtkjpdtwiazuszitjgabrz
- mweemtqqdsdinhwpfqrubqzbowju
- dyitjeffwvhwzlpksjjo
The first choice to solve it for any Power BI specialists is to split a column by delimiter in Power Query. Unfortunately, this option is not always available or comfortable (e.g. due to variable number of delimited columns or DirectQuery usage).
Let’s try to do it with DAX.
Note: For further analysis and methods comparison, I will use a Measure.
We can execute a simple parse: find substring between the second and the third delimiters, like this:
Third Value Parsing Measure =
var _val = MAX([Column1])
var _firstDelimiter = SEARCH(";", _val)
var _secondDelimiter = SEARCH(";", _val, _firstDelimiter + 1)
var _thirdDelimiter = SEARCH(";", _val, _secondDelimiter + 1)
RETURN
MID(_val, _secondDelimiter + 1, _thirdDelimiter - _secondDelimiter - 1)
Or, a much more easy-looking Measure with PATHITEM:
Third Value PATHITEM Measure =
var _val = MAX([Column1])
RETURN
PATHITEM(SUBSTITUTE(_val, ";", "|"), 3)
OK. PATHITEM looks pretty. What about performance?
I created 2 different visuals with measures that created above
Now, it’s time to check Performance Analyzer. Start recording, refresh visuals, voilà!
PATHITEM is not only easy-looking, it’s also an easy-performed operation. 10% effective for DAX query execution – not bad at all.
Conclusion
We can draw the conclusion that the PATHITEM trick should be the first option over a string. But as data pros we have to have a critical mindset. What if it is not the greatness of PATHITEM but just a lack of parsing method?
Conclusion
Let’s create a new test.
I will try to extract the very first value for a structured column:
First Value Parsing Measure =
var _val = MAX([Column1])
var _firstDelimiter = SEARCH(";", _val)
RETURN
LEFT(_val, _firstDelimiter - 1)
And the same as before for PATHITEM
First Value PATHITEM Measure =
var _val = MAX([Column1])
RETURN
PATHITEM(SUBSTITUTE(_val, ";", "|"), 1)
OK. See the Performance Analyzer:
Oops. PATHITEM takes twice as long as the only LEFT parsing method!
Final conclusion
PATHITEM is a good alternative solution that can be used instead of parsing statements that are too complex.
But as with many other programming algorithms, the choice of the appropriate method is always a compromise between coding, debugging and memory consumption.
Be the first to comment