In the previous Python tutorial for digital marketers 1, we discussed what a digital marketer can benefit from Python superpower, why she or he needs it, and how to install and set up the latest Python version for Mac OS. As you might be aware, one of the most essential Python benefits to digital marketers is to scrape web data and update the data automatically.
So in this Python Tutorial, I’ll talk about how to set up an environment to write python scripts for the purpose of scraping objective website data. This article doesn’t go into details regarding Python methods introduction, code writing, and feeding the data to a spreadsheet or database. I’ll release other articles and videos to walk through. But the purpose of this article is to let you understand the big picture of what components are necessary and how it works.
By the end of this Python Tutorial, you can master the installation of beautifulsoup4, requests, lxml, html5lib, and sublime text, and how to scrape web data by them.
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
Installing Beautifulsoup4 is not complex, below are the steps
1. Go to Pypi.org and download the latest version beautifulsoup4-4.9.3
2. Open Mac terminal, and input
(Note: Desktop means is the beautifulsoup4 file location you save)
sudo python3 ./setup.py install
3. Check if beautifulsoup4 is installed successfully
Input: pip3 install beautifulsoup4. If the return value is a requirement already satisfied, that means the installation is done.
Once it’s installed, we need to make sure we have parsers to parse the HTML. Parsers are essential to scrape the data and get the correct return result. Basically, it’s because the objective HTML page information matters. If the target page structures are built in a perfect form, there is no difference between the parsers, but if the target page structures have mistakes, different parsers can fill in the missing information differently and ensure the return result is correct.
In BeautifulSoup4 documentation, there is a section that explains the difference among parsers, but basically, they suggest installing and use the lxml parser and html5lib parser. So here I show how to install in Mac terminal:
Pip3 install lxml
Pip 3 install html5lib
Requests is a Python library used to easily make HTTP or HTTPS requests. Basically, its primary purpose is to call the objective data and show on your screen by running a Python script, which is functioning as you type in a URL on a browser to open the page. Generally, Requests has two main use cases, making requests to an API and getting raw HTML content from websites (i.e., scraping).
To install Requests is pretty easy, below are the steps
- Open the Mac Terminal.
- Input: pip3 install requests (Note: please remember to use pip3 if you haven’t created the alias between your Mac Python version and the latest Python3 version which I use here as an example, otherwise it might cause to install on a wrong folder path).
- Wait and see Requests are successfully installed, which include the date and related version information.
Sublime Text Editor
Sublime Text is a shareware cross-platform source code editor with a Python application programming interface (API) for free. It natively supports many programming languages and markup languages, and functions can be added by users with plugins, typically community-built and maintained under free-software licenses.
There are lots of available free editors such as atome, etc. You can use another similar software if you already have one. I’ll take sublime text as an example to walk you through how to use it to create scripts and scrape web data.
1. Check the build system and update the latest Python
In Sublime text, if you go to tools and build systems, you can find many programming language options are available, including Python. However, the default Python version might not be updated. As you can see from the below screencap, we select Python and input a single line code, and it shows Python 2.7, instead of the latest Python3
2. Add a new Python3 build system
Adding a build system and the script shows a line of code:
Replace it with the codes below and save. The latest Python3 version is created and you can check by inputting import sys, print(sys.version)
“cmd”: [“python3”, “-i”, “-u”, “$file”],
“file_regex”: “^[ ]File \”(…?)\”, line ([0-9]*)”,
Web Scraping Case:
Things are ready now, and we can test a web scraping in Sublime.
First of all, we need BeautifulSoup and requests, so let’s start by inputting
from bs4 import BeautifulSoup
And then below is a variable that requests to get HTML source text data of my website eCommerce article section
source = requests.get(‘http://www.easy2digital.com/topics/ecommerce/’).text
Then, we can parse this source code information into BeautifulSoup and print it out.
soup = BeautifulSoup(source,’lxml’)
Last but not least, we input command B to run the coding and as you can see, all source codes of the page are generated. This data is still not helpful because we need to create lines of code to specifically scrape the division data that we need.
Being said that, Web scraping environment by a sublime text editor is already working, and the thing is what we aim to scrape and write the codes based on the objective in a sublime text editor.
I hope you enjoy reading Python Tutorial for Digital Marketers 2: Web Scraping with BeautifulSoup, Requests, Sublime Text. If you did, please support us by doing one of the things listed below, because it always helps out our channel.
- Support my channel through PayPal (paypal.me/Easy2digital)
- Subscribe to my channel and turn on the notification bell Easy2Digital Youtube channel.
- Follow and like my page Easy2Digital Facebook page
- Share the article to your social network with the hashtag #easy2digital
- Buy products and software with Easy2Digital 10% OFF Discount code (Easy2DigitalNewBuyers2020)
If you are interested in chapter 3, please check out the article below