There are loads of websites out there selling data. Whether you want real estate agents, or small businesses, or email addresses of naïve people who never noticed that little checkbox stating that they would be receiving emails from third party vendors; it’s all out there for sale.
How do they collect this data? A lot of companies take the easy route, they buy it from another data company. First hand data is mostly collected at the bottom of the chain. The most common way is to own or operate a website that requires sign up before you receive further service or information. You sign up by submitting personal information such as your name, email address etc, and you become a fresh record in their database, worth money. I have to mention here, not all websites that require signup sell your information. But those that do, have it stated (sometimes not so clearly) on their websites.
Unfortunately, a lot of data collected thus, is dirty. And few data companies actually care to clean it up. You can get a zip code in the city field, and you shouldn’t be surprised. The contact phone number may only be 7 digits. A big percentage of the data is fake.
Of course, it goes without saying, clean data is worth more than dirty data. Something that will fit beautifully in your RDBMS setting, is even better.
Monday, August 31, 2009
Pipl.com
The first time I looked at this site it felt like any other people search engine. But a deeper look revealed the deep web. “They did it!” I said to myself. They finally reached the “deep” web!
The web is great, if you have a question, there is an answer. But when it comes to discovering aspects about individuals, the web is rewarding only in certain cases like if you are a celebrity for example. But what if I want to know something about “Joe Banks”, my plumber?
The deep web is where “we” reside. Realty sites could have you as the homeowner selling your property. Social networking sites could have you connected to your university network. A jobs site could have your application to the post of a Rocket Scientist.
A lot of information about us is stored in public online databases. Most search engines use a general spider that traverses links and indexes web pages in the process. Pipl.com navigates through underlying content such as those in online databases, to provide a more comprehensive profile about an individual.
To me, pipl.com has opened up a whole new dimension of data gathering.
The web is great, if you have a question, there is an answer. But when it comes to discovering aspects about individuals, the web is rewarding only in certain cases like if you are a celebrity for example. But what if I want to know something about “Joe Banks”, my plumber?
The deep web is where “we” reside. Realty sites could have you as the homeowner selling your property. Social networking sites could have you connected to your university network. A jobs site could have your application to the post of a Rocket Scientist.
A lot of information about us is stored in public online databases. Most search engines use a general spider that traverses links and indexes web pages in the process. Pipl.com navigates through underlying content such as those in online databases, to provide a more comprehensive profile about an individual.
To me, pipl.com has opened up a whole new dimension of data gathering.
Protect Thy Data on the Web
Yes, this just had to be my first post. I am not just continually obsessed with protecting my hard earned data, but also fascinated by newer ways of breaching security borders. You constantly hear about an injection or a hack, and in the days to follow, there comes a patch. Let’s try and not give them a chance?
Everyone wants to show off their data on the web, that’s great, that’s probably how you are making money. But, the web is vulnerable. We need to be proactive about protecting our data. Makes me think, hiring a crawler engineer as part of the QA/Security team may not be a bad idea. If your crawler engineer can hack your site, a lot of other people can.
From the top of my head:
• Use captchas (although captchas can be overcome)
• Play with your cookies
• Encrypt
• Try to POST more
• Use robots.txt
• Analyze web requests using a network analyzer such as Wireshark
• Consider limiting number of lookups per day per IP
• Monitor times between consecutive searches, shorter times may suggest robot activity
Be Safe!
Everyone wants to show off their data on the web, that’s great, that’s probably how you are making money. But, the web is vulnerable. We need to be proactive about protecting our data. Makes me think, hiring a crawler engineer as part of the QA/Security team may not be a bad idea. If your crawler engineer can hack your site, a lot of other people can.
From the top of my head:
• Use captchas (although captchas can be overcome)
• Play with your cookies
• Encrypt
• Try to POST more
• Use robots.txt
• Analyze web requests using a network analyzer such as Wireshark
• Consider limiting number of lookups per day per IP
• Monitor times between consecutive searches, shorter times may suggest robot activity
Be Safe!
Subscribe to:
Posts (Atom)