9 January 2024

How to Automatically Scrape Webpages on a Set Schedule

By Ronald Smith

Have you ever wondered how to extract data from web pages in a structured format? Well, there’s a technique called web scraping that can help you with that! It’s a super efficient way to gather data from online sites, especially if you need it for an application or another website.

Web scraping, also known as data scraping, has so many uses! You can compare prices across different websites, collect market research data, monitor products, and do all kinds of research. As a data scientist, I find it especially helpful for getting data that APIs don’t provide. But even if you’re just starting out or already an expert, web scraping can be a handy tool for comparing prices or collecting data from the web.

Hey there, let me show you two different ways to scrape data from websites. Don’t worry, I’ll explain everything in a way that’s easy to understand.

5 Best Web Scraping Tools

Web scraping tools are special tools made specifically for getting information from websites. They’re also called web harvesting tools. If you want to learn more about them, keep reading.

Method #1: Using a scraping tool

If you’re not a developer or don’t know much about Python, not to worry. I have a solution that’s perfect for you. There are lots of tools available that can help you scrape the web without needing to do much programming. Believe it or not, some web scrapers even have an easy-to-use interface.

Octoparse is an amazing tool that makes web scraping easy. In this post, I’ll show you how to scrape the web using Octoparse in just three simple steps. The best part is that Octoparse offers a free plan through its app, so you can do small scraping tasks without spending any money.

But that’s not all. Octoparse also has advanced features that let you scrape the web without any programming skills. One feature I really like is the templates. These templates allow you to scrape data from popular websites without having to do any configuration. For example, you can choose a template that’s specifically built for scraping product data from Amazon or eBay.

Automated Data Extraction

Octoparse is an amazing tool that I recently discovered. The best part about it is that it can automatically find and extract data from web pages. It’s like having a superpower!

When I first started using Octoparse, I noticed that it works really well with lists or tables of data. It’s the quickest and easiest way to get started with web scraping. Let me show you how:

  1. The first step is to go to https://www.octoparse.com/signup and sign up for a free account. It only takes a few minutes and you’ll be ready to go. Check out the image below to see what the website looks like:
  2. How to Automatically Scrape Webpages on a Set Schedule

  3. Once you’ve signed up and confirmed your email address, log in to Octoparse. You’ll be greeted with a screen that says “Account’s ready!” – exciting, right? Just click on the “Start Free Premium Trial” button to begin.
  4. Now, all you need to do is enter your card details or choose to pay with PayPal to activate the trial. It’s quick, easy, and totally worth it.
  5. To start, go ahead and click on the Download Our Free Software button. This will download the Octoparse application for either your Windows or macOS platform. Once the download is complete, click on the 8.1 Beta button.
  6. After the download finishes, unzip the file and open the Octoparse Setup 8.1.xx.exe file to install the application. Just follow the simple on-screen instructions.
  7. Once Octoparse is installed, open the application and log in to your account. Make sure to select the options for Remember Password and Auto Login for added convenience.
  8. Now you’re inside the Octoparse interface. Take a moment to explore and check out the available video tutorials. They’ll give you a better understanding of how to use the software.
  9. To start scraping a website, simply enter the URL of the webpage into the designated text box. For example, I’ll be scraping the following webpage: https://www.ebay.com/itm/Amazon-Echo-Dot-4th-Gen-With-Clock-2020-Smart-Speaker-Alexa-All-Colors-NEW/363219888368. How to Automatically Scrape Webpages on a Set Schedule
  10. Once you’ve entered the URL, Octoparse will load the webpage and analyze it to identify potential data that can be extracted. Make sure to uncheck the “Click on a ‘Load More’ button” option under Tips.
  11. If the data displayed at the bottom of the application is not what you’re looking for, you can click on the “Switch auto-detect results” option to see more potential data.
  12. When you’ve found the data you need, click on the “Create workflow” button to proceed with the scraping process.

How to Automatically Scrape Webpages on a Set Schedule

Extracting Data Manually with Octoparse

Sometimes, Octoparse’s automatic data extraction features may not be enough for you. Maybe the web page you’re trying to extract data from is complex or dynamic. But don’t worry, I’ve got you covered! Octoparse allows you to manually select the data you want to extract. Here’s how:

  1. First, follow the steps for “Using automatic data extraction” up to step #8.
  2. Octoparse will begin loading the web page and identifying potential data to extract. Under “Tips,” click “Cancel Auto-Detect” to switch to manual data extraction. How to Automatically Scrape Webpages on a Set Schedule

Alright, here’s what you need to do:

1. Start by clicking on the different data items you want to extract from the web page. For example, I clicked on the title, price, and shipping information for the Echo Dot 4th Gen.

[IMAGE-CLICK](/wp-content/uploads/how-scrape-webpages-7ee0.jpg)

2. Once you’re done selecting the data, open the “Tips” dialog and choose the “Extract data” option. This will help you test the workflow and make sure everything is working correctly.

3. Finally, don’t forget to save your work! Just click on the “Save” button in the top-left corner of Octoparse, and your task will be saved.

Now, I’ll show you how to automate the data extraction process so you don’t have to do it manually every time. Follow these steps:

1. Go to the task tab and click on “Run” in the top-left corner of Octoparse. This will run your task and extract the data for you.

That’s it! You’ve successfully created a task to extract pricing information from eBay using Octoparse. But remember, for now, you’ll still have to run it manually. If you want to automate it and extract data periodically, stay tuned for the next steps.

Here’s how you can schedule a task in Octoparse Cloud:

1. First, open the Run Task dialog by clicking on the “Schedule task (Cloud)” button.

2. Next, choose one of the options: “Once,” “Weekly,” “Monthly,” or “Repeats.” Select the one that fits your schedule.

3. Once you’ve chosen the frequency, you can configure the task accordingly.

4. Finally, click on either the “Save” or “Save and Run” buttons to save your scheduled task.

And that’s it! Your task is now scheduled and will run automatically in Octoparse Cloud based on your chosen settings. To view the data collected by the task, simply go to the Dashboard, click on the More button for your task, and select “View data > Cloud data.”

How to Automatically Scrape Webpages on a Set Schedule

Method #2: Building your own scraping program

If you’ve already looked into the first method and want more control, or if you’re a programmer interested in learning a programmatic approach to web scraping, this method is for you. With this approach, we’ll leverage a tool called Scrapy to create our own solution. I’m assuming that you already have some knowledge of HTML, CSS, and Python.

Scrapy is an open-source framework specifically designed for extracting data from websites. It’s a popular choice among data scientists for data scraping tasks. In my experience, it works really well for both small and large projects. However, for larger projects, you may need to do some extra configuration and use additional tools to optimize its performance.

Prerequisite

Hey there! Have you heard about CSS selectors? They’re super handy for zeroing in on specific elements on a webpage. If you’re curious to learn more, W3Schools has some great info on selectors. Let’s say you want to find all the big headings on a webpage – simply use h1 as the selector. Need help finding an element’s selector in Google Chrome? Here’s what you do:

  1. Right-click on the element you’re interested in, and choose Inspect. For example, I’m currently trying to find the selector for the “Example Domain” text below. How to Automatically Scrape Webpages on a Set Schedule
  2. Right-click on the element in the Elements section, go to Copy, and then select Copy selector.

How to Automatically Scrape Webpages on a Set Schedule

How to Scrape Web Pages with Scrapy

If you want to scrape data from a web page that hasn’t been covered in this post, or if you need to extract more information from those pages, you’ll need to find selectors and utilize them. To learn how, follow along with the steps below, as I demonstrate how to scrape the price of an Amazon Echo Dot 4th Generation from eBay using Scrapy:

  1. Create a new folder for your project, naming it something like “scrape-web-regularly”.
  2. Open a terminal and navigate to this folder (e.g., type “cd scrape-web-regularly”), then install Scrapy by running the following command: “pip install scrapy”.

If you encounter an error, pay close attention to the message for steps on how to resolve it. For instance, if the error states, “error: Microsoft Visual C++ 14.0 or greater is required. Get it with Microsoft C++ Build Tools: https://visualstudio.microsoft.com/visual-cpp-build-tools/,” you should click on the provided link to download and install it. Afterward, give it another try. In situations like these, you also have the option of using Anaconda to acquire prebuilt packages.

I am an eBay spider named ‘ebay_spider’. My purpose is to scrape data from the eBay website. Specifically, I will extract information from the link provided. Let’s get started!

First, I’ll go to the URL and look for the element that contains the data I need, which is the title.

Once I find the title element, I’ll extract the text data from it.

Then, I’ll yield or return the extracted data.

That’s it! I’ll keep track of my progress and provide updates along the way. Let’s begin the scraping process!

If you’re not seeing any of these messages, let’s troubleshoot. First, check your internet connection. If it’s working fine, try opening the link in a web browser. If it doesn’t open, the link may be incorrect. Double-check the value of “link1” in your code.

  1. If you still can’t get it to work, there could be a few reasons. One possibility is that you’re making too many requests in a short amount of time, which can cause the website to block you. To fix this, try increasing the interval between your crawls or runs.

If you’re unable to scrape a website using Scrapy, chances are the target website has implemented scraping protection technology to block your request. Don’t worry though, there’s a solution. Let me introduce you to ScrapingBee.

ScrapingBee is a web scraping service designed to bypass these scraping protection technologies and allow you to scrape the web without getting blocked. It offers an easy-to-use API that lets you scrape websites using headless browsers and rotating proxies, ensuring you can bypass any scraping protection technologies that may be in place.

With ScrapingBee, you can continue using Scrapy to scrape websites, without the fear of being blocked. It takes care of handling the scraping protection technologies for you, so you can focus on gathering the data you need.

So, if you’re facing issues with scraping using Scrapy, give ScrapingBee a try. It’s a reliable solution that will help you navigate the world of web scraping with ease.

When I first tried scraping with eBay, I ran into some issues. It turns out that eBay has measures in place to detect and block requests that don’t come from a real user using a web browser. This means that Scrapy, the tool I was using, didn’t work for scraping eBay. It’s not just eBay either; other big and popular websites can also start blocking requests from Scrapy.

This is where ScrapingBee comes in handy. It’s a service that helps bypass these blocks and allows you to scrape websites more effectively. I really like that ScrapingBee offers a free trial, which gives you 1,000 free API calls. This allows you to test out the service and work on small scraping projects without any cost.

Now, let’s dive into how to use ScrapingBee API in our Scrapy project:

    Hey, here’s what you need to do:

    1. Go to https://www.scrapingbee.com/ and click on “Sign Up” to register.

    ![ScrapingBee offers a web scraping API](/wp-content/uploads/how-scrape-webpages-ed5b7a.jpg)

    2. Once you’ve confirmed your email address and logged into ScrapingBee, you’ll see the dashboard. It provides your account information and more.

    ![ScrapingBee](/wp-content/uploads/how-scrape-webpages-77d18.jpg)

    3. In the “Request Builder” section, enter the link you want to scrape under “URL.” If the page doesn’t need JavaScript, uncheck “JavaScript Rendering.”

    As for me, I’m trying to scrape the price of Amazon Echo Dot 4th Gen from eBay.

    ![Build a request for the ScrapingBee API](/wp-content/uploads/how-scrape-webpages-740e6a.jpg)

    That’s it! You’re all set to dive into scraping with ScrapingBee. Happy scraping!

    Under the cURL tab, you’ll find a link you need to copy and paste into your code. To do this, remove the “curl” word and copy the link inside the quotes. For example, if the link is ‘https://www.ebay…’, you should use link1 = ‘https://app.scrap…’.

    Once you’ve done that, go back to step #4 in the “Using Scrapy” section.

    Now that you’ve built the scraper and the logic to scrape data, you’ll want it to run automatically at scheduled intervals, rather than manually running it yourself. To do this, we’ll automate our custom scraper.

    If you’re using a Linux OS like Ubuntu or Linux Mint, you can use a cron job to schedule your scraper. Follow these steps to set it up:

    1. [Read this cron guide](https://cronrats.com/cron-jobs-guide/)

    2. Follow the instructions in the guide to schedule your scraper.

    3. Once you’ve set up the cron job, your scraper will run automatically at the specified intervals.

    To schedule your scraper to run periodically using cron on Linux, follow these steps:

    1. Open a terminal and run crontab -e to edit your cron file.
    2. In the cron file, enter cd scrapy runspider try-one.py and save it to schedule your cron job.
    3. CRON_SCHEDULE: The default schedule is * * * * *, which means run it every minute. You can customize the schedule using the cron syntax, such as 0 * * * * for once every hour, or 0 0 * * * for once every day.
    4. PROJECT_DIR: This is the directory where you created your Scrapy project. Refer to step #1 in the “Using Scrapy” section above to find it.

    Now, your scraper will run automatically at the scheduled time using cron in Linux. If you are using Windows 10, you can use Task Scheduler to schedule your scraping task to run periodically. Check out my guide on automating repetitive tasks for more information.