Power Query: Blog List – Part 4

20 December 2023

Welcome to our Power Query blog. I would like to get a list of all the Power Query blogs I have posted. Today, I improve the query to extract data from all pages by extracting the last page number from the data.

I started writing this blog series way back in 2016:

There have been many developments since then, some of which I have reported upon. Since I am going to revisit more of the areas that have been improved, I am going to start with a reminder of what we have covered so far. In order to do this, I am going to use Power Query in Power BI, since there are some functions that will help me to access web data, which are not yet recognised in Power Query (Get & Transform) in Excel.

In Part 1, I noted that to get a full list, I am going to need several pages:

I extracted and tidied up the data for the first page:

In Part 2, I changed the Source step of my query to accept a parameter P_PageNumber, which allows me to get the data from a specific page. I used the following M code:

= Web.BrowserContents("https://www.sumproduct.com/blog/power-query-blogs?page="&Number.ToText(P_PageNumber))

I checked that this still gave the same results for the query:

Last week, I converted this query into a function, and generated a list of page numbers to concatenate my data:

This time, I start by going back to the function behind the query to review the step that I used to remove repetitive data from Description:

I remove the final step to see what is happening to the data in Description on the earlier rows in my combined query:

Clearly, I haven’t always been as consistent! A better approach would be to replace any repetitive text with empty space. I can do this in the combined query, currently called Query1:

I could improve my query further by removing the hard coded page number from my list generation and instead, extract the number of pages from the data. To do this, I begin by making a Duplicate of Table1:

I (optimistically) call my new query P_LastPage. I only want to keep the Source step:

I have highlighted the text I need, and I need the total number of pages from this:

<p>Page 1 of 37.</p>

Now, I need a way of extracting the number I want. My options at the moment appear to be limited:

I want to look for text before the delimiter “<p>Page 1 of”, and after the delimiter “.</p>”. To access this functionality, I need to convert the text to a Table:

This makes the data harder to see, but I know the delimiters I need to use:

On the Transform tab, I choose to Extract ‘Text Between Delimiters’:

I enter the required delimiters in the dialog:

This gives me the data I need, and I change the datatype to ‘Whole Number’ to allow me to use it in the list generation.

Now, I can right-click on and ‘Drill Down’ into this value:

This gives me my parameter P_LastPage:

I have renamed my main query to Extract Blogs. I need to change the current Source step:

Instead of the hard-coded value 40, I use P_LastPage:

Now I have the exact number of pages, I no longer need the ‘Filtered Rows’ step which was removing empty rows; therefore, I delete it and ignore warnings about deleting an intermediate step:

My query is ready to load, so I select all my data and ‘Detect Datatypes’ from the Transform tab:

Now the list is complete:

Next time, I will look at why this cannot be achieved in Power Query for Excel.

Come back next time for more ways to use Power Query!

Blog

Power Query: Blog List – Part 4

Request a free assessment