Skip to main content

DailyWiki: Building a Web Scraper with Python

Hands-On Lab

 

Photo of Keith Thompson

Keith Thompson

DevOps Training Architect II in Content

Length

01:00:00

Difficulty

Intermediate

Virtually limitless information is housed on the internet, but not all of it is accessible via APIs. Web scraping allows us to extract information from web pages so that we can use it in other applications or access it in different formats. In this hands-on lab, we'll use Scrapy to create a web scraper that will fetch us Wikipedia's featured articles and export them as a JSON file that we can access later.

What are Hands-On Labs?

Hands-On Labs are scenario-based learning environments where learners can practice without consequences. Don't compromise a system or waste money on expensive downloads. Practice real-world skills without the real-world risk, no assembly required.

DailyWiki: Building a Web Scraper with Python

Introduction

Virtually limitless information is housed on the internet, but not all of it is accessible via APIs. Web scraping allows us to extract information from web pages so that we can use it in other applications or access it in different formats. In this hands-on lab, we'll use Scrapy to create a web scraper that will fetch us Wikipedia's featured articles and export them as a JSON file that we can access later.

Connect to the Lab

Option 1: Connect with the Visual Studio (VS) Code Editor

  1. Open your terminal application, and run the following command:
    ssh cloud_user@PUBLIC_IP_ADDRESS
  2. Enter yes at the prompt.
  3. Enter your cloud_user password at the prompt.
  4. Run exit to close the connection.
  5. Run the following command:
    ssh-copy-id cloud_user@PUBLIC_IP
  6. Enter your password at the prompt.
  7. Open Visual Studio Code.
  8. In the seach bar at the top, enter cloud_user@PUBLIC_IP.
  9. Once you've connected, click the square Extensions icon in the left sidebar.
  10. Under Local - Installed, scroll down to Python and click Install on SSH.
  11. Click Reload to make the changes take effect.

Option 2: Connect with Your Local Machine

  1. Open your terminal application, and run the following command (remember to replace PUBLIC_IP with the public IP you were provided on the lab instructions page):
    ssh cloud_user@PUBLIC_IP
  2. Type yes at the prompt.
  3. Enter your cloud_user password at the prompt.

Set Up a Project and Virtualenv using Pipenv and the Scrapy Generator

  1. Create a new directory named daily_wiki with an internal directory of the same name.
    mkdir daily_wiki
  2. Change to the daily_wiki directory.
    cd daily_wiki
  3. Install Pipenv.
    pip3.7 install --user -U pipenv
  4. Install Scrapy.
    pipenv --python python3.7 install scrapy
  5. Activate the virtualenv.
    pipenv shell
  6. Create the project.
    scrapy startproject daily_wiki .

Create an Article Item

  1. Open the daily_wiki/items.py file, and add the following at the end of the file:

    import scrapy
    
    class Article(scrapy.Item):
      title = scrapy.Field()
      link = scrapy.Field()
  2. Save the file.

Create an Articles Spider

  1. Generate a new spider.
    scrapy genspider article en.wikipedia.org
  2. Open article.py, and edit the contents to the following:

    # -*- coding: utf-8 -*-
    import scrapy
    
    from daily_wiki.items import Article
    
    class ArticleSpider(scrapy.Spider):
      name = 'article'
      allowed_domains = ['en.wikipedia.org']
      start_urls = ['https://en.wikipedia.org/wiki/Wikipedia:Featured_articles']
    
      def parse(self, response):
          host = self.allowed_domains[0]
          for link in response.css(".featured_article_metadata > a"):
              yield Article(
                  title = link.attrib.get("title"),
                  link = f"https://{host}{link.attrib.get('href')}"
              )
    
  3. Save the file.
  4. Test the spider by running the following command:
    scrapy crawl article

Export Articles as JSON

  1. In the article.py file, add the following beneath the line that begins with start_urls:
      custom_settings = {
          'FEED_FORMAT': 'json',
          'FEED_URI': 'file:///tmp/featured-articles-%(time)s.json'
      }
  2. Run the spider.
    scrapy crawl article
  3. View the generated JSON file:
    ls -al /tmp | grep featured

Conclusion

Congratulations, you've successfully completed this hands-on lab!