To crawl or not to crawl, quickly crawling a site for fun and profit

Today I wanted to compare the page structures for 2 sites – one was a new version about to go live, and I wanted to compare it with the current version. I basically wanted to check that in doing our updates we’d left the URL structure fairly untouched (a few things had been added), and figured crawling them both then comparing the output would be a fairly sane way of doing that.

I’ve previously used – their free service is handy for small sites with under 500 links, but the site I wanted to compare had a pretty complex structure that resulted in a large number of links, so that was out of the question. So I figured, surely there would be a simple piece of code or a free util that could solve this problem for me quickly?

Using wget was a first and obvious option, and there’s an answer on stackoverflow which looks pretty good (and doesn’t suggest solving this problem using jQuery). However on running this I realised pretty quickly that I didn’t have sed installed, and all the sed options for Windows looked less than ideal. If you were a Unixy person, or a Windows person with a full set of Unixy tools installed, you’d be fine here. However, I figured there’d be another simple option, and wasn’t keen to mess with a bunch of installers for things I’d never use again, so I moved on.

Next, was Powershell. I figured someone would have written a PS script to do this – and they had! There’s a whole load of Powershell options, but none of them seemed to fly for me – bear in mind, at this point I’m still assuming this will be simple, so anything which doesn’t work straight off the bat gets discarded. I tried a couple of things: PowerShell script to make an XML sitemap, and Generate SharePoint 2010 Sitemap with Windows PowerShell. None of them worked, which could be due to Powershell changing somewhat since those posts were published, but either way I decided it was time to move on. I’ve had this recurring feeling for the past 5-6 years that Powershell is something that *should* be incredibly useful to me, and yet I’ve never found a way to leverage it. This reinforced this feeling. Another thing with my forays into Powershell sitemap generation was that I discovered that it’s something that’s apparently very important to Sharepoint people. Scary.

At this point I’m thinking that maybe I’m going about this all wrong? I’d started off looking at Google's list of sitemap generators, but had stayed away from the Windows executables, but maybe they were worth a look? Well, the first one found a great way to put me off pretty quickly: Requirements: Windows versions 95,98, ME, 2000, XP, 2003 Server, Internet Explorer v. 5.5 & up. I think I’ll pass thanks. I’m sure there’s a couple of good freeware or paid apps out there, but all the ones I looked at hadn’t been updated for a long time, and didn’t look like anything I’d want to install.

Right about now, I’m realising that my quest to get a quick tool or script and have run it against both sites within a few minutes was pretty futile, and I was wishing I’d just paid for the pro version of, as if I had I’d be done with this by now. But once you’ve spent a chunk of time on something you get pretty determined to solve it in a way that you’re happy with, and for me that meant something that was reusable and didn’t involve a monthly fee (not saying the monthly fee is high, just that it’s something that I’d use once or twice a year, so a monthly subscription is a waste).

So we move to C# solutions. There were quite a few quick attempts, a few stackoverflow posts that were half baked 20-25 lines of code which wouldn’t help anyone, and a range of other stuff. Do the search yourself and sort through it if you want to waste some time, however in the end I came across abot.

Abot is an open source C# web crawler built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc..). You just register for events to process the page data. You can also plugin your own implementations of core interfaces to take complete control over the crawl process.

That’ll do nicely, thanks. Abot seems to have been updated recently and regularly, and does exactly what I wanted. There seems to be something in the unit test project which was causing the copy of Visual Studio 2012 on my desktop to crash, but I just didn’t load that project and all was well. I used the demo program included in the solution, modified the config, and had what I needed to stop wasting my time.

In hindsight I’d have paid a few bucks for a month’s subscription to, but after the first 30 minutes of investigation it became a point of principle. Next time this happens, I’ve got a nice chunk of code to use.

Despite the fact that this was a gigantic waste of time, it was quite interesting as to how outdated a lot of the sitemap generators / crawlers are. Maybe I wasn’t searching well, although I did rope a few other people into searching as well, so I don’t think that’s the case.

Anyway, learn from my mistakes!

Tags: ,

Posted on Thursday, December 19, 2013 8:04 PM |

Like this? Share it!

  • # re: To crawl or not to crawl, quickly crawling a site for fun and profit
    Commented on 12/23/2013 8:14 AM

    Hi Ross!
    You can also check, my little side-project. Please contact me after you register and I'll give you bunch of crawling credits:) New version will soon be live, but you can use the current one to find broken links, slow pages, bad titles, etc.

  • # re: To crawl or not to crawl, quickly crawling a site for fun and profit
    Commented on 11/14/2017 2:49 PM

    Sounds like you missed both the best free and paid desktop tools at the time including mine A1 Sitemap Generator - only paid at the time though. Been in active development since 2005... Go wonder - the first raw sitemapping code took two days and worked - now 12 years on the software and its crawler is still worked on regularly)

Post a comment
Please add 6 and 5 and type the answer here:
Remember me?
Ensure the word in this box says 'orange':