How to Scrape Multiple Webpages that Don’t have a …


In my other community blog post, How to Scrape Multiple Webpages,  I discussed how to connect to a table from a webpage and use the M code generated by the web connector’s dialogue box to get the same table but from across multiple web pages.  But how about if the web connector cannot detect  a table or cannot generate a table with the information needed using Add Table Using Examples feature? A good example are image URLs. To do that, we’ll need to inspect the underlying HTML code.

 

Head over to https://edge.pse.com.ph/companyPage/stockData.do?cmpy_id=15.  This is a company information page of a company called ABS-CBN, a Philippine media company, in the Philippine Stock Exchange company listing website. Notice the image/logo of  the company.  Our goal is to be able to get the URL of that image.

danextian_0-1691471905055.png

 

Right-click on the image, click inspect/inspect element (the options will depend on the browser you’re using I’m using Chrome) and then we will try to find a pattern.

danextian_1-1691473237986.png

 

The URL starts with /clogo. This could be a pattern but let’s double check.  Press Ctrl + F on your keyboard and type /clogo to check if there are  more than one /clogo words. It appears there’s only one so we will stick to this.

danextian_0-1691643610260.png

 

Let’s go back to the query editor. Duplicate the StockData query from the attached sample pbix in my previous blog post and rename it to CompanyLogo. Delete all applied steps from and after Added Custom. Create a custom column called HTML using this formula. Web.BrowserContents allows as to access a webpage as HTML text but this is not our final formula as we want to be able to refresh the resulting dataset in Power BI Service.

 

 

= Web.BrowserContents("https://edge.pse.com.ph/companyInformation/form.do?cmpy_id=" & Text.From([Page Number])

 

 

 

This how the query should look like after adding that column:

danextian_4-1691645596851.png

 

Now let’s add another column to extract the logo URL. Highlight HTML column, go to Add Column tab, click Extract and then Text Between Delimiters. Use /clogo as start and delimiter and as end.  The query should now look like below:

danextian_5-1691646328042.png

 

Notice the rows with a blank result. Let’s go to the website and check if indeed the related pages do not have a logo. It appears they really don’t so we’ll leave it at that.

danextian_6-1691646473750.png

danextian_7-1691646504947.png

 

Now, complete the URL. Going back to the inspect element part of this blog, the complete path to the logo starts with the website’s base URL. Replace all empty cells with null. Add a custom column called Logo URL  with the following formula.

 

 

= if [Text Between Delimiters]  <> null then "https://edge.pse.com.ph/clogo" & [Text Between Delimiters] else null

 

 

 

After removing custom columns HTML and Text Between Delimiters, the query should now look like this:

danextian_0-1691666511422.png

 

As mentioned earlier in the post, the use of Web.BrowserContents is only temporary. Chris Webb mentions in his blog that “…if you dynamically generate a url for use with Web.Contents() and do not use the RelativePath and/or Query options you will not be able refresh your dataset because the Power BI Service will not be able to analyse your code to discover what urls are present in it.”

While it may be a different function, the same still applies. We made use of dynamic URLs and Web.BrowserContents doesn’t have RelativePath parameter so using it will inhibit us from refresh  out dataset in Powr BI service – until, at least, the service engine updates to accomodate such case.

 

However, do note that while Web.Contents has RelativePath parameter, unlike Web.BrowserContents, it doesn’t return the underlying html code right from the get go but, instead, binary. To demonstrate, we will change the function used in our HTML custom column.

danextian_0-1691669060705.png

 

In order to return the URL as HTML code, we need to use Lines.FromBinary to return a list which every item is a line of html code  and Text.Combine with line feed “#(lf)” as the delimiter to combine each item in the list into a single text.  Change the custom column formula to the following.

 

 

let 
html = Web.Contents ( "https://edge.pse.com.ph", [RelativePath = 
    "companyInformation/form.do", Query = [cmpy_id =  Text.From ( [Page Number] ) ] ] )
in Text.Combine ( Lines.FromBinary ( html ), "#(lf)" )

 

 

 

The rest of the query will remain as is so we should still be seeing the same result. You may now load this into the model.

 

In this blog post, we have discussed

  • how to get the image URL from a webpage by finding the pattern in the underlying HTML. 
  • how to return the binary file returned by Web.BrowserContents as html.

Attached is sample pbix for youre reference. A live report using a similar technique can also be found herehttps://app.powerbi.com/view?r=eyJrIjoiZjljYmViZTQtMGI1NC00ZmI3LTlkNTUtYTIxYmIxMWVlMjk0IiwidCI6ImRmO… 

 

power query multiple web pages.pbix



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*