How to Build a Simple Scraper in Python

November 28, 2017

It is very common to have a website that contains data you need to analyze, but usually websites present data in HTML format, which can be difficult to work with. Manually copying and pasting into spreadsheets might work if the data set is small, but it will be frustrating and time consuming to use the same technique for bigger amounts of data.

 

A preferred method to extract information from any website is to use an API. Most large websites provide access to their information through APIs, but this is not always the case for other websites. This is where scraping comes in.

 

Web scraping is an automated technique used to crawl websites and extract content from them. But before discussing the technical aspects, I need to mention that scraping a website must adhere to a website's terms and conditions and legal use of data.

Why Python?

 

I chose Python for this tutorial because of its ease of use and rich ecosystem. There are many libraries that can be used for scraping purposes, but I will use "BeautifulSoup" as a parser and "urllib" as a URL fetcher to walk you through the easiest way to implement a web scraper.

Inspecting the Page

 

Building a scraper will be an adaptable process that takes layout modifications and website structure into account, this is not a onetime task.

 

I chose a GPI blog as a source of information for my scraper. As you inspect the HTML code shown in the screenshot below, it turns out that the Div that has all the blog information is <div id="article"> and each blog item is under <dl class="blogItem">.

 

gpi-python-1

The Code


Let's start by importing the libraries we are going to use for this task.

 

# import libraries
import urllib.request as ur
from bs4 import BeautifulSoup


Since the GPI blog has many pages, it's better to prompt for number of pages to scrape to keep it minimal. This is a good option if you don't want to go too aggressive on any website, which could get you banned as a spammer.

 

max = input("How many pages do you want to scrape? ")

count=1

while count < int(max)+1:

    url = '/translation-blog.aspx?page=' + str(count)

    data = ur.urlopen(url).read()

    soup = BeautifulSoup(data, 'html.parser')

    items = soup.find_all('dl', attrs={'class': 'blogItem'})

    for item in items:

        print("Blog URL:", item.find("a").get_text())

        print("Blog Title:" , item.find("a").get("href"))

        print("Blog Date:" , item.find("span", attrs={'class': 'date'}).get_text())

    count += 1

 

Once we have the number of blog pages we need to scrape, content retrieval will be simple. This will retrieve the blog URL, title and publish date and print it to your console. In a real scenario, you would be interested in getting data into a well-structured more tabular format, Pandas DataFrame is likely to be used.

 

A DataFrame is an object that stores data in a tabular format, which facilitates data analysis.

 

Below is how the final code looks:

 

import pandas as pd

import urllib.request as ur

from bs4 import BeautifulSoup

max = input("How many pages do you want to scrape? ")

count=1

records = []   

while count < int(max)+1:

    url = '/translation-blog.aspx?page=' + str(count)

    data = ur.urlopen(url).read()

    soup = BeautifulSoup(data, 'html.parser')

    items = soup.find_all('dl', attrs={'class': 'blogItem'})

    for item in items:

        blogurl = item.find("a").get_text()

        title = item.find("a").get("href")

        bDate = item.find("span", attrs={'class': 'date'}).get_text()

        records.append((bDate, title, blogurl))

    count += 1

df = pd.DataFrame(records, columns=['date', 'title', 'url'])   

df.to_csv('gpiblogs.csv', index=False, encoding='utf-8')

 

Here is a preview of the output file:

 

gpi-python-2

Summary


Building a web scraper in Python is relatively easy and can be accomplished in a few lines of code. Web scraping in general is a fragile approach, though. It is reliable if used with well-structured web pages with static informative HTML tag attributes. APIs (if provided by a website) are the preferred approach since they are less likely to break.

GPI Resources on Connectors and Website Development

 

Globalization Partners International (GPI) frequently assists customers with multilingual website design, development and deployment, and has developed a suite of globalization tools to help you achieve your multilingual website localization project goals. You can explore them under the Translation tools and Portals section of our website.

 

You may also find some of the following articles and links useful:

 

 

For more information or help with you next website translation project, please do not hesitate to contact us via e-mail at info@globalizationpartners.com or by requesting a free web translation quote on your next translation project.

Category:
Website Translation
Tags:
website scraping, website scraping with Python

GPI Project Manager Completes NYC MarathonTop 20 Most Requested Translations: English to Traditional Chinese

Comments

Currently, there are no comments. Be the first to post one!

Add a comment

Only comments with valid email addresses will be posted. Thanks!

Contact Us FREE Globalization eBooks Request Demo Request Quote


Rabab is a native Arabic speaker from Cairo, Egypt. She has over 12 years’ experience in software and websites architecture and implementation using open source programming tools including PHP, MySQL as well as jQuery, JSON and Ajax. She has served with various companies including Pyramid Technologies, Link-Dot-Net and InTouch Communications as a Programmer, Systems Engineer – Developer and Lead Web Developer. She earned her B.Sc. Degree in Economics with a minor in Computer Science from Cairo University. She completed a SSDP Diploma from the Information Technology Institute (IDSC), The Cabinet and is a certified Cloud Business Associate and a Microsoft Certified Professional (MCP, MCAD and MCSD.NET) Over the years she has designed and implemented a range of localized web-based business applications and portals for clients in the finance, hospitality, telecommunications and e-commerce sectors. In her free time she enjoys, traveling, reading, swimming, tennis and outdoor activities.