How to Scrape Multiple Webpages


Power Query is a very powerful tool to scrape information from public webpages. It can be as simple as entering the URL and selecting the desired table. But what if the information needed is from the same table but across multiple web pages? In this blog, I will discuss how to import those tables as well as how to be able to refresh the resulting dataset in Power BI Service.

 

We will be scraping data from Philippine Stock Exchange site, specificially the stock data page.  Let us explore this URL in Power Query: https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=30

Open Power BI Desktop, click Home Tab, select Get Data and then Web. Enter the above URL 

danextian_0-1691382997795.png

danextian_1-1691383098882.png

 

Once hitting OK, we’ll be presented with several tables. Tick  the checkbox for Table 6 and then click Transform

danextian_3-1691383480879.png

 

We’ll want anything that is blank to be null so right click on a blank cell, click Replace Values and and enter null into Replace With field. Click OK.

danextian_4-1691383650315.png

danextian_5-1691383717683.png

Blank spaces should have been relaced with null (in italics). 

danextian_6-1691383775071.png

Next, transform the table so the first column becomes the header while the second column contains the records. Go to Transform tab and select Transpose. Select Use First Row as Headers after transposing. 

danextian_8-1691384026817.png

After click Transpose and Use First Row as HeadersAfter click Transpose and Use First Row as Headers

After promoting the headers, Power Query automatically detects the data types. We can modify this step but we can just leave that as is. We will not have additional transformations after this. Now, let’s head back to our URL:   https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=30. Notice the number 30 which is actually the webpage number. We will be using this to dynamically get data from other webpages. 

 

Create a blank query and call it StockData. Enter ={1..30} in the formula bar of Source step. This will create a list of numbers from 1 to 30.  There are about 300 companies listed in the website but we’ll just use 30. 

danextian_10-1691384619957.png

 

 

 

 

 

danextian_12-1691385102099.pngGo to Transform tab  and select To Table and click OK. This will convert the list to  a table. Rename Column1 to Page Number and change the data type to whole number. 

 

Head back to Table 6 query, go to Advanced Editor and copy everything. 

danextian_13-1691385217805.png

Go back to StockData. In Add Column tab, select Custom Column. Name this as Table. Paste the copied code in custom column formula. We will modify this to dynamically import Table 6 from web pages 1 to 30.

Change https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=30″  tohttps://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=” & Text.From([Page Number]). Notice that I have enclosed [Page Number] in Text.From. We need to convert this to text so we can combine it with another text string which is the other part of the URL.

The custom column will generate a column of tables containing the needed data. Some of the rows will return an error. This is expected as those erring pages do not have Table 6. We’ll modify the formula to take the error into account. Change the last line from #”Changed Type1 to try #”Changed Type1″ otherwise null 

danextian_15-1691386997653.png

 

We can then expand this column of tables and apply additional transformation as needed. 

danextian_17-1691387144203.png

 

danextian_16-1691387138317.png

Please note that the formula we used in creating the custom column does not allow us to refresh this report in Power BI servce.

To check, go to Home then Data source settings. We should be seeing this warning from the dialogue box.

 

danextian_18-1691387298907.png

Why is this so? “The problem is that when a published dataset is refreshed, Power BI does some static analysis on the code to determine what the data sources for the dataset are and whether the supplied credentials are correct. “ -Chris Webb writes in his blog post Web.Contents(), M Functions And Dataset Refresh Errors In Power BI


We will modify the Web.BrowserContents portion of the formula so we can refresh this report in the service. Head over to the Advanced Editor and replace

 

 

 

Source = Web.BrowserContents("https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=" & Text.From([Page Number])),

 

 

with

 

 

Source = Web.Contents("https://edge.pse.com.ph", [RelativePath = 
    "companyPage/stockData.do", Query = [cmpy_id =  Text.From([Page Number])]] ),

 

 

 

Go to Data source settings to check if this query can now be refreshed in Power BI Service. The warning caused by the previous formula should have now disappeared. 

danextian_0-1691414159917.png

 

In this community blog post:

  • you have learned how to scrape data from multiple webpages without creating a query for each of them
  • adjust the web connection formula so it can be refreshed in Power BI Service.

Attached is a sample pbix for your reference. A live report using a similar technique can also be found here – https://app.powerbi.com/view?r=eyJrIjoiZjljYmViZTQtMGI1NC00ZmI3LTlkNTUtYTIxYmIxMWVlMjk0IiwidCI6ImRmO…

power query multiple web pages.pbix



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*