Import vs DirectQuery in Power BI


Over the last few months, I have received a number of very similar questions at either one of our Live Training courses or from a student attending one of our Ask Anything Power BI Q&A sessions.  These questions all revolved around DirectQuery and were often specifically about performance issues or complexity of DAX calculations being written on the Power BI reports.  This prompted me to write this post, as it became abundantly clear that most students were opting for DirectQuery when they should have been selecting Import.  If you are not sure if DirectQuery is right for you, then read on, as we explore further.

But before we get started, in the words of Chris Webb (Principal Program Manager, Microsoft Power BI Customer Advisory Team):

The most important piece of advice around DirectQuery is to NOT USE it unless you’re really, really, really sure you have no other option. It’s possible to make it work well but it takes a lot more tuning and specialist skill than Import mode.

Chris Webb

What is DirectQuery?

When you connect to any data source with Power BI Desktop, you can import a local copy of the data into the Power BI Vertipaq database, but for some data sources, such as a SQL Server database, instead of choosing to import, you can opt to connect directly to the data source by using DirectQuery.  But just because you can, doesn’t necessarily mean you should.  More on that in a moment.

Connection Mode

There are essentially three different ways Power BI can fetch data from a data source.  Here are the main differences between the different connectivity modes:

Import

Import is the default option for acquiring data in Power BI, and for most scenarios is the most appropriate choice.  This option results in Power BI loading a copy of the data from the selected tables and columns of your data source.  As your report readers interact with visualizations on your report, Power BI queries the imported data stored in Power BI’s VertiPaq database.

Import Connection

Benefits: In memory database engine is Very Fast and Efficient.

Drawbacks: To see data changes that occurred after the import or most recent refresh, you must refresh the dataset and import the data again.

DirectQuery

As the name suggests, this option does not import data into Power BI, but instead maintains a direct connection with the original source.  It does however copy across the schema information (table, columns, relationships, etc), and that metadata is stored locally within Power BI to allow it to generate appropriate queries to analyse the data in the source.  With DirectQuery, when you create or interact with a visualization, Power BI sends a query to the underlying source.  This redirection of the query is completely transparent to the report user. The time that’s needed to refresh the visualization depends on the performance of the underlying data source. If the data needed to service the request was recently requested, Power BI can use the recent cached data to reduce the time required to update the visual, but you should assume that every interaction on your report is going to result in another set of queries being sent to your data source.

Directquery Connection

Benefits:  As you create or interact with visualizations, Power BI queries the underlying data source, so you’re always viewing current data

Drawbacks:  As you create or interact with visualizations, Power BI queries the underlying data source, so you’re constantly asking the data source for more information which can be inefficient and potentially expensive if your database is housed on a warehouse that has compute or query volume charges.  To help minimise these charges, make sure you are taking advantage of the Query reduction (1) and Apply all slicers button (2) on your DirectQuery reports so that you can control when the queries get sent to your source.

Query Reduction

Nota Bene : For a full listing of data sources that support DirectQuery, see Data sources supported by DirectQuery.

Live

There is actually one other way Power BI can connect to a data source, and that is via a Live connection. Live connections are only available on a few data sources, specifically, Analysis Services and published Power BI datasets.  When you connect to a SQL Server Analysis Services or Azure Analysis Services data source, you can choose to import the data or use a live connection to the selected data model.  When connecting to a Power BI dataset, there is no option to import, so these data sources are always Live connections.

Using a Live connection is similar to DirectQuery.  No data is imported, and the underlying data source is queried to refresh visuals.  However, Live connections differ from DirectQuery connections because Power BI doesn’t store any information locally.  In a DirectQuery connection, Power BI stores information about the data source schema to allow it to properly translate DAX queries into appropriate data source (SQL) queries, but the data itself remains in the data source.  In a Live connection, the source database is a VertiPaq database, so Power BI only needs to retain the connection string, as the data source model is already able to understand DAX queries.

Live Connection

Benefits: Live connection datasets use the VertiPaq database, so you can expect the performance to be fast and efficient, very similar to an Imported model, but as these reports do not store any data locally, they are an excellent solution for reporting on regulated or single-truth data sources.

Drawbacks: The dataset model is an external source, so you will not be able to make changes to the model structure.  You can still create new measures, but if you need to add additional data sources into the model, Power BI will require you convert your live connection into a DirectQuery, Composite Model.

When is it appropriate to use DirectQuery?

Before we discuss reasons for choosing DirectQuery mode, let me borrow some more sage advice from Chris Webb, who says

“Import mode should always be your default, and if you don’t know which one to use, use import mode”.

Chris Webb

But there are a few times when you will need to choose DirectQuery, so let’s take a peek at these next.

Real-time scenarios

Where you need to see the latest available data from the source, such as Traffic Congestion monitoring.

Note that for most reports, recent data is actually preferable to current data.  For example, when you are analysing sales performance for this week.  If you have an Import data set that is refreshed 8x during working hours, then the numbers you are working with will be static during a typical browsing session, even at detailed granularities.  Contrast that with the same situation using a DirectQuery dataset, where if a sale happened on the product you were analysing between one click of your report and the next, then the numbers will reflect the current state, so now any metrics connected to that sale will also change.  This can get very confusing if you are not familiar with real-time data.

Huge Datasets

Where you have many billions of rows of data (more than 10Gb) so you just can’t import it into Power BI, or the time it takes to load the data into Power BI is untenable and the data is obsolete by the time the refresh has completed.

In this case, I am talking about multiple billions of rows of data.  A well-structured dataset of 1-2 billion rows will under most circumstances still be better in Import than in DirectQuery.  Generally speaking the issue that drives people toward DirectQuery on large datasets is the time it takes to refresh and import the data into Power BI.  This however is usually a data preparation issue rather than a storage issue.  If you are having to perform significant amounts of transformation of your data as it loads, then the refresh will be slow.  Bad news is, if your data is requiring a lot of transformation using Power Query, then it is not suitable for a DirectQuery connection, and you need to explore a data preparation solution before you decide on Import vs DirectQuery.

If you’d like to find out more about preparing your data, we’ve blogged about Shaping and Modelling for Power BI best practice previously.

Regulatory compliance considerations

When the data must stay in the source for data security/privacy reasons

This one is perhaps the only one that I will accept without much push-back.  If your compliance authorities dictate your data must stay in the source, then it must stay in the source.

That being said, it is still worth considering why the regulator is insisting on single source storage.  Microsoft has worked with regulators globally for many years.  If the resistance to Import is due to authorization certification, then there is a good chance Microsoft already has certification for your regulator, and that means you can import your data to a Golden Dataset in the service, storing the copy securely on compliant Microsoft servers, and then working with a Live connection, thin Power BI workbook, to build and manage the reports.  This would be my preference for most individuals stating compliance reasons for why they chose a DirectQuery connection.

Any other dataset scenario

If your data does not fit into one of the specific categories mentioned above, then the chances are you should not be opting for DirectQuery.  One of the main reasons why, is because with a DirectQuery data set, you have two tools that need to be in-tune and working together, which in turn often means you need at least two subject matter experts working on the report.  We’ll assume for the moment that you are the Power BI expert who can handle the DAX side of the queries, but to get a DirectQuery report working well you will also need skills to tune the data source side of the queries, which generally means you will need to develop a strong working relationship with your Database Administrator (DBA).

More to Come Next Week

This is a big topic – too big for a single blog article.  This week I introduced the topic of data source connectivity modes, explaining the differences between Import, DirectQuery and Live data sources.  I also highlighted that for the vast majority of situations, DirectQuery will not be the most appropriate connection method.  Having identified when it is appropriate to use a DirectQuery connection, next week we will explore Data Modelling for DirectQuery.





Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*