Indexes and tags

Posted on

I thought it would be a good idea to begin tagging my posts as a way to resurrect some of the older ones. It would also be fun to get a high-level look at what I have written about over the years. For several weeks now I’ve been on a sort-of tagging R&D marathon and I thought it might be helpful to give a little update on my progress.

My first thought on tagging was that I should have done it from the beginning, as I post. That seems reasonable, but this has been very difficult for me to keep up with. Clearly I have not done a good job. Additionally, the tags that I did have are long gone now.

The manual approach

In light of not having any tags, I figured I’d have to do it by hand. I knew of several WordPress plugins that could help me tag my posts en-mass, so that would get me started. I installed Simple Tags, and then promptly froze.

At the time of writing I have 776 posts. If I were to spend a mere 30 seconds per post it would take me 6.5 hours to get through all of them. But we all know 30 seconds is impossible. Most of the time I don’t remember what a post is about so I have to re-read it.

Let’s pretend that I read at an average rate of 300 words per minute. There are 157,462 words total across all my posts which means it would take me 8.75 hours to get through them all—and that still doesn’t account for the actual typing of tags and potential contemplation about which tags to use. So let’s add back in that 6.5 hours for a grand total of 15.25 hours. That’s pretty daunting.

The automated approach

I noticed that Simple Tags could automatically tag my posts using some sort of Natural Language Processing API. I went out and got some API keys for the free services and plugged them in. AlchemyAPI seemed to do the best job so I started popping open posts and running through the process of tagging via that service. That was still taking too long though, and the results weren’t that great.

I found myself supplementing each set of tags with tags that were a better fit for the post. I was doing the same thing I was before, but I had added an extra step. This was going to take forever.

The semi-automated approach

My next idea was to use the AlchemyAPI to generate tags for all of my posts at once, and then manually update them in a spreadsheet. Once I finished, I would import them all and be done with it. I would have all the tag data plus the benefit of hand-selecting each one.

I downloaded the Python SDK and whipped up a little script. In less than an hour I had tags for half the posts. I only had half because I met my API quota for the day. No big deal. However, once I started analyzing the results, I realized that they really weren’t going to work. As cool as the AlchemyAPI service is, it just wasn’t cutting it.

Indexing software

Things were not working out the way I had hoped they would. Time for a new idea. I read that tags are like the index at the back of the book and that really struck a chord. Maybe there was some indexing software that I could use to generate a list of keywords.

As I searched, one main theme became clear: indexing software is a pain in the ass. The task of creating an index (at least with software) is best left in the hands of a trained professional. That I am not. But there was another theme that surfaced as well…

Manual indexing

Take it old school, they say. The idea is that you go through all the copy that you want to index, highlighting words and concepts that you want to include, then you add them to a master list with references to the text…

Sounds a lot like what I started with. :smile: