When Google+ launched and I got one of the early invites some two weeks ago, I was pretty damn excited. Not only is Google+ a welcome other place on the Web to hang out – no pun intended – I was also looking forward to start some coding I had in mind for longer.
I started working on Google+ Counter a few days later and within just three days, more than 30.000 Google+ members submitted their profiles.
So what exactly is Google+ Counter?
Well, first and foremost it’s my personal proof of concept for a couple of technologies I planned to play with for quite some time.
At the moment of this writing, the only visible part is the Hall of Fame, a nice collection of the top followed people on Google+. Hover over an image and you get some details. Click it and you’re taken to the member profile.
I do, however, already collect much more than just that.
And am planning on giving users back more than just follower counts soon. Who needs those scores, anyway?
Before I move on, let’s discuss privacy and Google+ Counter for a moment.
It’s as simple as this: Google+ Counter can only access what’s visible publicly.
I don’t use any special API (in fact, there is no Google+ API, yet) nor can my scripts log into Google+ as you, a process usually established via OAuth. Google+ Counter to Google+ really just looks like any anonymous visitor. It’s not even signed in.
All the crawler scripts see is the minimum information, you’ve decided to make available publicly.
I’ve been intrigued by the technological challenges of creating services that scale for almost my entire professional life. For example, over at GrandCentrix we run a platform that drives the mobile experience for Germany’s Pop Idol TV show. When the show’s final airs, we literally see hundreds of thousands requests coming in within seconds.
Handling that kind of load is not easy to solve with out-of-the-box recipes.
But dealing with peak load is only one aspect of large-scale services, handling massive amount of data is another one. And Google+ was sort of the ideal playground for me to start some experiments.
I built the frontend with PHP and jQuery. Don’t laugh, please. I know, it’s way more on vogue doing it in Rails these days. However, as there would not be so many dynamic parts on the site anyway, I chose to go for the old and reliable horse that is PHP.
Wait a minute? Not so many dynamic parts? Isn’t Google+ Counter all about analyzing a social network that is sort of a continuous moving target?
Sure it is. But the frontend part of Google+ Counter – and that’s the one leveraging a bit of PHP – comprises almost completely of static HTML pages.
Early on, I made a design decision to move all the hard and CPU intensive work to background processes running asynchronously.
Actually, the only processing happening in real-time when somebody visits is when users add a new profile. Google+ Counter instantly reaches out to the new user’s profile, grabs a name and a follower count and provides immediate feedback.
The profile URL then gets normalized and enqueued for later processing.
Throughout the day, a set of hardcore Python worker processes is launched in 30 minute, 60 minute and daily intervals to build the static pages of the site.
The Hall of Fame is as plain a HTML page as it can get. No PHP scripts and no database roundtrips are involved at all. The worker processes take care for everything, from updating profile links to adjusting the paginated navigation at the top of the page.
By far the most difficult part was dealing with the data changing over time.
A newly added profile is added to the index within a maximum of one hour after it has been submitted. I’ve chosen this slightly larger interval because the initial parsing of a complete profile can take some time, depending on the number of public posts etc.
Subsequently, Google+ Counter updates all of its data on an half hourly base. That means, we re-crawl all of the 30.000+ profiles every 30 minutes and extract profile and additional stream information. This includes keywords found in public posts.
We also keep a complete history, so we will be able to identify trends shortly. Not only in follower count but also hot topics, areas of expertise and much more. I’m not planning to offer a Google+ search engine, as I’m pretty sure, Google will add that shortly.
But it opens for a number of interesting opportunities, I’ll talk about in another post.
The Hall of Fame is updated hourly and it always incorporates the latest available data.
I use a number of technologies, from compressing data to map reduce to allow the worker scripts to scale beautifully. So far, 30.000 listed profiles do not seem to do much harm, so I’m eager to see what will happen, when I hit the 100.000 mark or even larger numbers.
So far, creating Google+ Counter has been a fun experience.
I’ve been focussing on developing mobile apps for some time. Most of those require some sort of backend either for content management or driving application logic. I found returning full force to Web Development and playing with large datasets very rewarding. And I hope, what I’ve learned will also allow me to build better mobile experiences.