Power Query is a very powerful tool to scrape information from public webpages. It can be as simple as entering the URL and selecting the desired table. But what if the information needed is from the same table but across multiple web pages? In this blog, I will discuss how to import those tables as well as how to be able to refresh the resulting dataset in Power BI Service.
We will be scraping data from Philippine Stock Exchange site, specificially the stock data page. Let us explore this URL in Power Query: https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=30
Open Power BI Desktop, click Home Tab, select Get Data and then Web. Enter the above URL
Once hitting OK, we’ll be presented with several tables. Tick the checkbox for Table 6 and then click Transform.
We’ll want anything that is blank to be null so right click on a blank cell, click Replace Values and and enter null into Replace With field. Click OK.
Blank spaces should have been relaced with null (in italics).
Next, transform the table so the first column becomes the header while the second column contains the records. Go to Transform tab and select Transpose. Select Use First Row as Headers after transposing.
After promoting the headers, Power Query automatically detects the data types. We can modify this step but we can just leave that as is. We will not have additional transformations after this. Now, let’s head back to our URL: https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=30. Notice the number 30 which is actually the webpage number. We will be using this to dynamically get data from other webpages.
Create a blank query and call it StockData. Enter ={1..30} in the formula bar of Source step. This will create a list of numbers from 1 to 30. There are about 300 companies listed in the website but we’ll just use 30.
Go to Transform tab and select To Table and click OK. This will convert the list to a table. Rename Column1 to Page Number and change the data type to whole number.
Head back to Table 6 query, go to Advanced Editor and copy everything.
Go back to StockData. In Add Column tab, select Custom Column. Name this as Table. Paste the copied code in custom column formula. We will modify this to dynamically import Table 6 from web pages 1 to 30.
Change “https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=30″ to “https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=” & Text.From([Page Number]). Notice that I have enclosed [Page Number] in Text.From. We need to convert this to text so we can combine it with another text string which is the other part of the URL.
The custom column will generate a column of tables containing the needed data. Some of the rows will return an error. This is expected as those erring pages do not have Table 6. We’ll modify the formula to take the error into account. Change the last line from #”Changed Type1 to try #”Changed Type1″ otherwise null
We can then expand this column of tables and apply additional transformation as needed.
Please note that the formula we used in creating the custom column does not allow us to refresh this report in Power BI servce.
To check, go to Home then Data source settings. We should be seeing this warning from the dialogue box.
Why is this so? “The problem is that when a published dataset is refreshed, Power BI does some static analysis on the code to determine what the data sources for the dataset are and whether the supplied credentials are correct. “ -Chris Webb writes in his blog post Web.Contents(), M Functions And Dataset Refresh Errors In Power BI
We will modify the Web.BrowserContents portion of the formula so we can refresh this report in the service. Head over to the Advanced Editor and replace
Source = Web.BrowserContents("https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=" & Text.From([Page Number])),
with
Source = Web.Contents("https://edge.pse.com.ph", [RelativePath =
"companyPage/stockData.do", Query = [cmpy_id = Text.From([Page Number])]] ),
Go to Data source settings to check if this query can now be refreshed in Power BI Service. The warning caused by the previous formula should have now disappeared.
In this community blog post:
- you have learned how to scrape data from multiple webpages without creating a query for each of them
- adjust the web connection formula so it can be refreshed in Power BI Service.
Attached is a sample pbix for your reference. A live report using a similar technique can also be found here – https://app.powerbi.com/view?r=eyJrIjoiZjljYmViZTQtMGI1NC00ZmI3LTlkNTUtYTIxYmIxMWVlMjk0IiwidCI6ImRmO…
Be the first to comment