In Pursuit of Data Happiness...: 2009

Thursday, October 1, 2009

Google Wave

Just watched the preview of Google Wave. Its rightly described as a communication and collaborative tool. The basic idea is you start with a wave and add multiple participants to that wave, in order to share your communication. Pretty much like a bulletin board or forum, except much cooler and smarter.

Things I went “wow” about:

• Open Source: Yes, Wave is going to be open source and it has some amazing embed APIs. Personally, I love the idea of open source, simply because it’s collaborative.

• Real time: My jaws dropped when I watched this. Wave has real time character by character transfer as the conversation happens. Yes, no more of the dreaded “ is typing….” messages to wait on.

• Playback: Really useful feature that does a very simple thing. Plays back all messages in a conversation chronologically.

• Picture sharing: With waves, pictures can be part of your conversation. A simple drag and drop instantly transfers thumbnails to the other participants while the actual images upload.

• Embed into blogs: This one’s exceptionally cool; an incredible new twist to blogging. Drop a blog icon onto your waves, and the conversation is transformed as a blog post on your blog! And yes, as you edit your wave conversations, witness live updates on your blog!

• Collaboratively author documents

Once again, Google Rocks!

Monday, September 21, 2009

MySQL for Database Administrators

I just completed my week long training “MySQL for Database Administrators” at Sun Microsystems, Santa Clara. Now, I am back at work.

It’s almost amazing how much this training can transform you. It’s one thing to build applications using MySQL, but a totally different perspective is required to administer a MySQL database. Yes, your development team is almost going to hate you for this.

Anyway, this course is about how MySQL works, the ultimate truth. And how you can turn that to your advantage. Some very interesting topics covered are NDB Cluster and Blackhole storage engines and database and server optimization. You also get to try out hands-on replication, backup and recovery techniques and such.

For this training, they also offered their new hands-on MySQL DBA 5.1 certification at the end of the course.

Yes, I passed.

Tuesday, September 15, 2009

Integrity Vs Performance

Its not an easy decision. If you love your data as much, you probably have nightmares all the time about your server crashing. And so, you have daily backups setup, and even binary logging on a separate physical device, may be even clusters, and yet you remain paranoid.

On the other hand, you have your application team constantly complaining about how your databases are sooooo slow. Yes, they don't have any idea how much work it took to organize the data the way it is, choosing the right storage engines, optimizing the tables ever so frequently, and tuning the several hundred server variables just to get to this point.

The fact remains, you can't have a perfect system. You have to compromise, depending on your use case.

It's all about the balance...

Tuesday, September 1, 2009

Gmail Outage

So, Gmail is down. Service outage.

Complaints and frustrations are pouring in, even when some have been experiencing the downtime only for a few minutes now. Innumerable tweets about the outage have made Gmail one of the Trending Topics on Twitter. Tech crunch posts a related article about 7 minutes ago, and there are 115 comments already.

I just want to say, Google rocks!

P.S. Yes, they are continually assuring us that they are looking into the problem. They have also put up videos with step by step instructions on how to access you email via IMAP or POP.

Check status at: http://www.google.com/appsstatus#rm=1&di=1&hl=en

Monday, August 31, 2009

Data Marketing

There are loads of websites out there selling data. Whether you want real estate agents, or small businesses, or email addresses of naïve people who never noticed that little checkbox stating that they would be receiving emails from third party vendors; it’s all out there for sale.

How do they collect this data? A lot of companies take the easy route, they buy it from another data company. First hand data is mostly collected at the bottom of the chain. The most common way is to own or operate a website that requires sign up before you receive further service or information. You sign up by submitting personal information such as your name, email address etc, and you become a fresh record in their database, worth money. I have to mention here, not all websites that require signup sell your information. But those that do, have it stated (sometimes not so clearly) on their websites.

Unfortunately, a lot of data collected thus, is dirty. And few data companies actually care to clean it up. You can get a zip code in the city field, and you shouldn’t be surprised. The contact phone number may only be 7 digits. A big percentage of the data is fake.

Of course, it goes without saying, clean data is worth more than dirty data. Something that will fit beautifully in your RDBMS setting, is even better.

Pipl.com

The first time I looked at this site it felt like any other people search engine. But a deeper look revealed the deep web. “They did it!” I said to myself. They finally reached the “deep” web!

The web is great, if you have a question, there is an answer. But when it comes to discovering aspects about individuals, the web is rewarding only in certain cases like if you are a celebrity for example. But what if I want to know something about “Joe Banks”, my plumber?

The deep web is where “we” reside. Realty sites could have you as the homeowner selling your property. Social networking sites could have you connected to your university network. A jobs site could have your application to the post of a Rocket Scientist.

A lot of information about us is stored in public online databases. Most search engines use a general spider that traverses links and indexes web pages in the process. Pipl.com navigates through underlying content such as those in online databases, to provide a more comprehensive profile about an individual.

To me, pipl.com has opened up a whole new dimension of data gathering.

Protect Thy Data on the Web

Yes, this just had to be my first post. I am not just continually obsessed with protecting my hard earned data, but also fascinated by newer ways of breaching security borders. You constantly hear about an injection or a hack, and in the days to follow, there comes a patch. Let’s try and not give them a chance?

Everyone wants to show off their data on the web, that’s great, that’s probably how you are making money. But, the web is vulnerable. We need to be proactive about protecting our data. Makes me think, hiring a crawler engineer as part of the QA/Security team may not be a bad idea. If your crawler engineer can hack your site, a lot of other people can.

From the top of my head:

• Use captchas (although captchas can be overcome)
• Play with your cookies
• Encrypt
• Try to POST more
• Use robots.txt
• Analyze web requests using a network analyzer such as Wireshark
• Consider limiting number of lookups per day per IP
• Monitor times between consecutive searches, shorter times may suggest robot activity

Be Safe!

In Pursuit of Data Happiness...