Power BI/Power Query – Web Scraping (a full page with images)

A few days ago, I blogged about how to extract data from PDF file and how it’s very easy to do that with Power BI. The information extracted were the financial data of all Algerian insurance companies.
The same colleague told me that in the UAR’s website, there are many interesting information about the companies (like address, name of CEO, logo, phones …), is there a way to import them in Power BI? I said yes, you can scrape a full webpage with Power BI. Let’s see how it’s work!
In this tutorial, I will show you how to import the following information from this link:
1- the official name / 2- the commercial name (the name after “coordonnées de la” / 3- the address / 4- the logo web address (the main difficult and interesting part)

Just follow these steps:

  1. In get data tab choose from web and past the webpage link then select Add tables using examples.
  2. In the first column, we will add the “official name” by taping the first letters of the first company name in the first cell. You will see many suggestions, just select “L’Algérienne des assurancecs -2A-“. In the second cell we write “L’AlgérienneVie” and press enter. Then Power BI will suggest all other companies’ names in the column.

C. Repeat the step B to get the Commercial name and the address in the column 2 and 3
D. Name the 3 columns by double clicking on the header of each column and give them a name
E. After pressing OK, select the button Edit. In Power Query you get this table with this M code in the advanced editor:

You can notice that the CSS selector for each columnis between brackets
Column1 = {« Official name », « [style*= » »padding-right\:15px » »] »}
Column2 = {« Commercial name », ».tx-column:nth-child(3) >DIV.tx-heading[style*= » »text-align\:left » »][style*= » »margin-bottom\:10px » »]:nth-child(1):nth-last-child(2) »}
Column3 = {« Adress », »:nth-last-child(7):nth-child(3) »}

F. Now, let’s import the logo web address (this blog from Chris Webb helped me a lot). First, we see the html attributes of the images

Then we add this line of the M code to get a new column with logos web address {« imgurl », »img[src^= » »http » »] », each [Attributes][src]}}. With this code, we are asking Power Query to get all the elements that start with http of each attribute src and of each class img. (for people who don’t understand this line, it’s html code and it’s very simple to learn it).
Now we have our 4th column :

Then we add this line of the M code to get a new column with logos web address {« imgurl », »img[src^= » »http » »] », each [Attributes][src]}}. With this code, we are asking Power Query to get all the elements that start with http of each attribute src and of each class img. (for people who don’t understand this line, it’s html code and it’s very simple to learn it).
Now we have our 4th column :

Hope you enjoyed this tutorial. You can download the PBIX file here. If you have any question, don’t hesitate to contact me. See you 😊

Publicités

4 commentaires sur “Power BI/Power Query – Web Scraping (a full page with images)

Répondre

Entrez vos coordonnées ci-dessous ou cliquez sur une icône pour vous connecter:

Logo WordPress.com

Vous commentez à l'aide de votre compte WordPress.com. Déconnexion /  Changer )

Photo Google

Vous commentez à l'aide de votre compte Google. Déconnexion /  Changer )

Image Twitter

Vous commentez à l'aide de votre compte Twitter. Déconnexion /  Changer )

Photo Facebook

Vous commentez à l'aide de votre compte Facebook. Déconnexion /  Changer )

Connexion à %s