Scrapy is a Python application framework for crawling web sites and extracting structured data. I challenged myself to see if I could write a web crawler that I could use to crawl this blog and scrape all of the post titles. Here’s the code I ended up using.
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://technicalagain.com/']
def parse(self, response):
for title in response.css('h1.title'):
yield {'title': title.css('a ::text').extract_first()}
for next_page in response.css('div.nav-previous > a'):
yield response.follow(next_page, self.parse)
After writing the crawler above, one simple command then executes the crawler and writes the output to a .csv or .json file which is shown at the top of the post.
scrapy crawl blogspider -o blogs.json
There are duplicates due to the way the “previous” page renders in WordPress. I could put in some extra logic to remove duplicates but that can be a challenge for another day. This got the job done and is a nice utility that can come in handy.