Photo by Safar Safarov on Unsplash.
Recently, I've started building a web scraper to collect houses available for sale from real state websites. The main goal was to learn how to use Next.js and, at the same time, build something useful for me - Yes, I'm looking for a house to buy!
The web scraping part isn't going to be the focus of this blog post. If you are interested in that, there are several solutions out there to use. Instead, I'm going to focus on the automation part, i.e., after having the web scraper ready to collect data, how can I turn it into an autonomous solution that periodically collects data?
Free stuff is awesome ๐
I don't like spending money even less in services used in personal projects. If possible, I always look for free solutions available in the market to build my project - Portuguese people love free stuff even if it is not useful ๐ต๐น.
In the particular case of running the web scraper periodically, I had three solutions in mind:
- Create a cron job in my personal computer and run the scraper there;
- Create a VPS, deploy the scraper and create a cron job to run periodically;
- Use Github Actions to run my scraper.
Solution 1) was free but required me to have the computer 24/7 on to guarantee that the scraper collects all the daily information. The free part was cool, but I don't always have my computer turned on so, this solution is a NO โ.
Solution 2) allows me to solve the problem of having the personal computer always on, but a VPS costs money right? Besides that, some of them can take some hours to configure which can be kinda boring when you are excited to have your personal project up and running. This solution is a NO โ.
Solution 3) has the best of both worlds. It is free and doesn't require me to have a personal computer running 24/7. According to Github, Github Actions are "individual tasks that you can combine to create jobs and customize your workflow.". The cool part is that you can create your own custom actions or use actions from the community, which can help you save a lot of time. This solution is a YES โ .
Basically, Github Actions gave me what I needed. It allowed me to setup a job that can run a web scraper and schedule it to run every day at 00:00.
Here is the action configuration:
1on: 2 schedule: 3 - cron: '0 0 * * *' 4name: Scrap Data 5jobs: 6 build: 7 name: Build 8 runs-on: ubuntu-latest 9 steps: 10 - uses: actions/checkout@master 11 - name: Build 12 working-directory: ./scraper 13 run: npm install 14 - name: Scrape 15 env: 16 DB_PATH: ../client/data/db.json 17 working-directory: ./scraper 18 run: npm run scraper 19 - uses: mikeal/publish-to-github-action@master 20 env: 21 GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # GitHub sets this for you
This action is split in two parts:
- Trigger
This part is where I can schedule the job. I'm using a temporal trigger, but other types of triggers can be used, for example each time a commit is pushed to master.
- Jobs Configuration
This part is where I configure which jobs should run when the action is triggered. In this particular example, I'm configuring a job named Build that runs the following steps:
- Checkout the master brach;
- Installs the scraper dependencies;
- Runs the scraper;
- Pushes the collected information to the master brach. This last step is using the publish-to-github-action-action Github Action.
You can find more information about jobs in the official documentation.
Voilรก! I have a fully working web scraper collecting data every day for me without speding a single โฌ.
I hope you liked it.
Happy coding ๐ค, Nuno Cruz