Python Tutorial for Digital Marketers 10 – Python Web Scraping – Pagination and Shopify Product Pages

In previous chapters, we discussed how to scrape website HTML information and Shopify product information via JSON API. Actually, on most websites and platforms, there is more than one page to show articles, products, and so on. Basically, we call it pagination, for example, page 1, or previous page or next page, and the mentioned codings and dataset only scrape the single URL page.

In this article, I would walk you through how to scrape the pagination using Python, via either the website HTML or JSON API, for the purpose to scrape all target objective information. By the end of this article, you can master Pandas library and some new methods, and you can customize the script based on your business needs.

In previous chapters, we discussed how to scrape website HTML information and Shopify product information via JSON API. Actually, on most websites and platforms, there is more than one page to show articles, products, and so on. Basically, we call it pagination. For example, page 1, or previous page or next page, and the previous codings and dataset only can scrape the single URL page.

In this article, I would walk you through how to scrape the web pagination and Shopify product JSON API pages using Python. It’s for capturing all target datasets in bulk. By the end of this article, you can master Pandas library and some new methods. Also, you can customize the script based on your business needs.

Import Web Scraping Modules

We would use bs4, requests, and Pandas library in this script. As we would take Shopify as the other example as well, so we need to import JSON

from bs4 import BeautifulSoup

import requests

import pandas as pd

import json

web scraping

Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool. It’s built on top of the Python programming language. It is pretty useful to restructure the dataset and save it in CSV format.

Identify the website pagination URL structure

I take Easy2Digital’s blog folder for the 1st example. As you can see from the blog path, the number following up after page/ is the location of the backward pagination page. Thus, we can create a variable that can be after page and change, loop to scrape accordingly

http://www.easy2digital.com/blog/page/1/

http://www.easy2digital.com/blog/page/2/

http://www.easy2digital.com/blog/page/3/

web scraping

Here are the codings where we set the pagination as ‘x’, and we use the ‘for’ looping, range function, and str function. 

Range function actually creates a sequence of numbers from 0 to N, and prints each item in the sequence. In this case, we can set a number like 20, and this number is already more than my blog pagination pages. I recommend

The str() function of Python returns the string version of the object. It ensures the return is a string.

Last but not least, we need to create a variable with an empty value at the moment, which is used for generating the whole scraping dataset at the end.

easy2digitalweb = []

for x in range (1,20):

URL = 'http://www.easy2digital.com/blog/page/'

easy2digitalR = requests.get(URL+str(x))

soup = BeautifulSoup(easy2digitalR.content,'lxml')

content = soup.find_all('article',class_='d-md-flex mg-posts-sec-post')

If we have to scrape via the platform API like Shopify, below are the codings by taking another website for the example – Wasserstein Home

web scraping

In the Shopify frontend product API, the JSON structure is like this, where each page has at most 250 pieces of product data. Then page parameter represents the pagination value

URL = 'https://wasserstein-home.com/products.json?limit=250&page=

So it’s quite similar to website HTML pagination, but just need to scrape via the platform API

for x in range (1,10):

URL = 'https://wasserstein-home.com/products.json?limit=250&page='

productweb = requests.get(URL+str(x))

ProductData = productweb.json()

Write lines of code to scrape target datasets

Now we already have scraped the block data and it’s time to find what data we need. 

Below is the Easy2Digital blog example for your reference. For more details, please check out another article, because we have talked about it previously.

for element in content:

title = element.h4.text

landing = element.h4.a['href']

summary = element.find('div',class_='mg-content').p.text

Python Tutorial for Digital Marketers 4: How to Specify Web Data to Scrape

Python Tutorial for Digital Marketers 8: One Script to Scrape Competitor Shopify Web Product Data

Append the Web Scraping dataset

Previously in the CSV module and Google, we talked about how to append the scraped dataset. Here we are using Pandas library, which is more convenient to manipulate the data in row and column

First thing first, we create a variable to define the scraped dataset name. Then, we can append function and the data can be organized into a separated column with the unique head name defined in element_info

element_info = {

'title': title,

'landing': landing,

'summary': summary

}

easy2digitalweb.append(element_info)

print(len(easy2digitalweb))

Then we use the len() function, in order to show how many pieces you can scrape, and the number helps you understand if the dataset size makes sense or not.

Save the dataset in excel format using DataFrame and to_csv methods

df = pd.DataFrame(easy2digitalweb)

print(df)

df.to_csv(‘easy2digitalblog.csv’)

Those who are familiar with R know the data frame as a way to store data in rectangular grids that can easily be overviewed. Each row of these grids corresponding to measurements or values of an instance. And each column is a vector containing data for a specific variable. This means that a data frame’s rows do not need to contain, but can contain, the same type of values: they can be numeric, character, logical, etc.

DataFrames in Python are very similar, they come with the Pandas library, and they are defined as two-dimensional labeled data structures with columns of potentially different types. In general, you could say that the Pandas DataFrame consists of three main components: the data, the index, and the columns.

We use the data frame function and to_csv function, that is along with the Pandas library, below are the final script of the Shopify product pagination scraper and the generated excel file.

Full Python Script of Shopify Product Feed Data Scraper

If you would like to have the full version of the Python Script of Shopify Product Feed Data Scraper, please subscribe to our newsletter by adding the message Python Tutorial 10. We would send you the script immediately to your mailbox.

Contact us

So easy, right? I hope you enjoy reading Python Tutorial for Digital Marketers 10: Python Web Scraping – Pagination and Shopify Product Pages. If you did, please support us by doing one of the things listed below, because it always helps out our channel.

  • Support my channel through PayPal (paypal.me/Easy2digital)
  • Subscribe to my channel and turn on the notification bell Easy2Digital Youtube channel.
  • Follow and like my page Easy2Digital Facebook page
  • Share the article to your social network with the hashtag #easy2digital
  • Buy products with Easy2Digital 10% OFF Discount code (Easy2DigitalNewBuyers2021)
  • You sign up for our weekly newsletter to receive Easy2Digital latest articles, videos, and discount code on Buyfromlo products and digital software
  • Subscribe to our monthly membership through Patreon to enjoy exclusive benefits (www.patreon.com/louisludigital)

Leave a Reply

Your email address will not be published. Required fields are marked *