The Crafting of WikiScraper

Tim Tatchell
2 min readAug 23, 2021

WikiScraper was originally going to be a single page web scraper, however when I realised the potential of object-orientated programming through Ruby, It would be quite achievable to be able to scrape the entirety of a website. One page at a time.

I considered many websites to use for this project, but the instant I thought of Wikipedia I knew it would be the right choice. It’s a large website and each page *should* be in a consistent format. But, as all things, easier said than done.

Getting started was easy enough, create a To-Do list, build out the gem structure and connect to Git. I did some testing with HTTParty with Nokogiri to translate. I was able to get basic requests from a Wikipedia article. The next step was to make it so I could pick ANY Wikipedia article to scrape.

Building all of my classes and their functions to maintain abstractness was — to me — the fun part. It fascinates me how a simple computer program is able to interpret different situations and produce the same result.

Next, I ran into a problem. The HTML layout of Wikipedia was not as sturdy as I thought. It was more of a block of information that needed to be sorted. With no simple way to access the first paragraph I developed a method which would look for the first suitable “p” with content in it. I paired that summary with a title and boom — We have content.

Getting the topics from the article was the final tricky part. I used the combination of a heading array and a hash which contained content that sat under each heading. Each item in the block of information had to be checked if it was suitable (and if it even had text in it at all).

The content was fine, now I invented a way to search for any article in Wikipedia. While it was a simple string-interpolation to check for an article, it did the job. I could now search and scrape any article on Wikipedia quickly and easily.

Once this was complete, the rest was straight forward, designing the program flow so it was easy to follow and ironing out any pesky bugs or simple errors.

Done. And I couldn’t have enjoyed the process any more.

I am keen to see what Ruby on Rails can bring.

--

--