Big Data Assets series – Internet Interaction Data
After talking about the Man-2-Machine and the Machine-2-Machine Big Data assets, it’s time to take up one of the richest and most sophisticated sources of information you could productize: the internet interaction traces
Based on the browser history, you could re-create a real person’s psychological profile… discovering facets of your personality not even the closest relatives or your partner wouldn’t ever know about. Nothing reveals so much about a particular individual: the history of all your purchases, all your searches, the places you have been, the health issues you had, the football team you are supporting, what you paid for your house, almost everything… It’s actually the logbook of the modern people’s life.
Of course and thanks God this information is not available, but many of the traces we left when we surf the web. And this topic is so sensitive, that controversial debate about the use and abuse of cookies pops up over and over.
What was conceived to support some degree of customization to improve the user experience (like remembering already entered fields in a form or choices made in previous visits to a site) has become the cornerstone of gathering visitors’ information (items added or removed from the shopping cart, referring keyword, exposure and reaction to certain ads, newsletter level of engagement and each and every single interaction, mouse position tracking, the dwell time, the visit duration and a long and ever evolving etcetera)… Sometimes using questionable methods like the ever cookie (an almost never-expiring hard-to-delete cookie mechanism) but always in the data protection debate.
Actually, there’s nothing bad in the initial idea of that, which again ties back to the idea of filtering out information that is not relevant for the user based on what’s known about this particular user… but when too much is too much the reluctance against it can’t but grow. When you start searching for a Christmas present for your wife and the day after, she gets bombarded with ads showing what you searched for is trully too much.
The advent of the social networks added a new dimension to personal information sharing debate: the disclosure of those that are digitally connected to you. Although in the first social network ever registered, the social graph based on the phone calls and later text messages, is certainly much older than the internet itself, the possibilities were limited: it was only about how often, how long and how recently to quantify the social relationships between people… and unlike nowadays, this information was only in the hands of Mobile Phone Operators. In the social networks of the web 2.o era, the content available to determine the nature of a relation between 2 given people is enormously rich and likely to be openly accessible (public tweets or FB applications granted to leverage the breath of the people connected to you). Natural Language Processing and Semantic Analysis techniques can exploit the content originated in the social interaction of two users to take the social network analysis (SNA) to the next level.
There are several sites where the aggregated information about users’ interaction with the World Wide Web can be queried with different levels of complexity. Google offers at a very aggregated yet indicative level, the so called trends, where anyone can get information about the evolution of the volume of searches for a particular term over the time and even for a particular state or region. Additionally, the same tool offers related keywords and the possibility of comparing trends.
More sophisticated and maybe more powerful is the AdWords API, where you can query the estimated volume you can generate by booking a keyword at a given Cost per click and all this for a particular city and even at district level depending on the place. Over the time, you could create models to understand and predict the demand for a given keyword and thus model the preferences evolution of the people in a given area.
Obviously nobody is in a better position to access and monetize the insights on individuals’ internet usage than Google: to the already monopolistic search, which after the Gmail introduction can be legally traced at user level –the so called personal search-, they added the information generated by their AdServer Double Click –with a market penetration of 80% in the States-, the display content network –also bookable via AdWords-, AdSense -one of the most popular ways of monetizing user generated content sites, like blogs, etc-, Google Analytics -one of the best examples of a great Big Data product which tracks the way people move in the sites that implement it-, YouTube –generating precious statistics on people watching preferences-, their G+ network and of course everything happening in your Gmail account. It shows how seriously is Google taking this topic.
But of course, they are not alone in this market. The display advertising industry is already evolving towards a data driven decision making in terms of which ad shall be served next -the traditional AdServer labor-, but also which placement shall be purchased next and at which price -the so called Real Time Bidding-. To take this kind of decisions, the profiling of users based on their surfing behavior is critical. Companies like BlueKai are being really successful in this area.
The only way of getting even more and better information would be going a level down in the OSI Network level and applying deep packet inspection (DPI) techniques. And as scary as it sounds when we think of our internet connection at home or at work, it gets even more powerful as internet goes mobile: you could attach a geo-location tag and a time-stamp to any internet interaction performed by any user.
For example, this level of insights would change the way we know the shops at present: if a retailer knew that before going to her store, you’ve been browsing the competitors’ sites, if a retailer knew that you are comparing prices right after leaving the store, if a retailer could get any information in advance about your preferences and your purchasing drivers… you’d certainly be offered a completely new buying experience, completely tailored to you…
And in spite of the terrifying data protection caveat, it would be a win-win situation: you just get targeted offers based on your commercial decision drivers and your circumstance, so you save time and efforts and so does the retailer!
(This post is part of the Big Data Assets series, discover them all!)
2 Responses
[…] Internet browsing data or how the online traces we left behing can defeat the data deluge […]
[…] Recipe to cook a tasty and lucrative Big Data product. I’ve posted about Man-2-Machine, M2M, Internet browsing data, etc. but what about the core business […]