Category Archives: Collection

When the State sells your personal data…

…it is with the Privacy Authority blessing!

This case is typical of the current French system, but this may echo some concerns elsewhere, so I have made a summary of an original post in French, so as to trigger discussion…

Basically all French politicians are claiming they want to fight the Privacy breaches, especially when North-American internet companies are involved. Still, the findings are clear: the French State and its state-owned companies are doing exactly the same!

First, the rule: any form collecting any of your (personal) data is to offer to the users the ability to give or refuse consent, regarding any further use of their data. In France, this is clearly stated in a law, passed as early as 1978 (named “Data Processing and Freedom”, a full concept in itself…).

There are two possible questions (“opt-in” or agree, and “opt-out” or disagree) and two ways of offering the answer (“active” and “passive”); this implies four different ways of collecting any consent (see table below), and of course generates confusion.


The “passive” response mode is neither common, nor recommended in France (even if it is not forbidden, as far as I may know), but it is more often found on US websites. In this case, the check-box is already ticked, and the user’s answer is registered by default. To register another choice, one has to un-tick the check-box. Clearly this option may only be used online.

For paper forms, one may focus on the “active” mode, when a check-box has to be ticked. The two remaining options are:

  1. Active opt-in = the user agrees, by ticking one or several check-boxes, that his/her personal date may be stored, reused, transferred, sold to third parties. This is the most respectful mode for the user, as only active opt-in guarantees that the user has chosen to give away his/her data. But this is not the norm…
  2. active opt-out = the use disagrees that his/her personal data are used, still by ticking a check-box. This mode is the most commonly used in France, and the Privacy Authority (CNIL) implicitly endorses this behavior. On its website, namely in the Q&A section of their website (only available in French), the CNIL mentions that the user may “oppose” to personal data transfer to third-parties or “refuse” that such data be used for commercial purposes. they endorse the opt-out mode.

Of course, many user just forget to tick check-boxes (or worse, do not find them), and hence are included by default on files sold to third-parties, namely for business purposes. This may be understandable for private companies, but when it comes to the Government or to State-owned companies, this is more disputable!

I have covered two examples in the original French version of this post, taken from recent experiences, that I believe may not be of interest for non-French speaking people. My comments are backed up by solid material.

Hence, the French State collects – and resells – personal data from its citizens, while Google, Amazon or Facebook are blamed for doing the same… You mean “contradiction”? I say “opportunism”.

For a true personal data protection, one has to develop alternative targeting tools! “Fingerprinting” or “unique identifier”, are mentioned, but there also is a non-intrusive option, based on the user’s online behavior. I am working on it… Willing to know more? Stay tuned and come back next week on this blog!

[This post is a summary of a longer original version written in the French-speaking section of; the original version in French namely includes pics and explanations of two opt-out examples]

Data Elicitation in three steps (2/3): Data Enrichment

The second step of elicitation is enrichment. Once your data have been patterned, you have to design their look and feel. This is what data enrichment is about.

Again key questions in this area:

  • How do I qualify my data? → a categorization scheme is key to facilitate relevant data extraction for future analysis
  • What may be missing or on the contrary is uselessly kept? → choosing one’s data set is necessary for correct data acquisition
  • What type of addition is relevant and what is really useful? → as existing information may not be sufficient for achieving marketing studies
  • How do I acquire additional attributes at an optimum price? → collect, derive or generate additional data at proper cost

No question, your data are rich, especially if you can use them easily thanks to an appropriate patterning. But they certainly can be richer. Much richer. And there are hundreds of ways to enrich data, but only two dimensions to consider, quantity and quality.

You may have tons of data, and still this may not fit your purposes. Or on the contrary have scarce resources, but with a very high (and maybe hidden) value. Market Research companies used to name this data enrichment processing “coding the dictionary”, a phrase showing the richness of this process, both on the quantity (the number of words) and on the quality side (the clarity of the definitions). Getting the relevance out of the data is definitely a precious skill, and one of my own key proficiencies.

I shall definitely develop both aspects of data enrichment in future posts but I wanted to cover them shortly in this introduction.

1. Quantity

One always seems to be missing data. More e-mails for more direct marketing contacts, more socio-demographics for a better segmentation, more inputs from the sales force for a more precise CRM, more, more, more…

As usual, this may be true. Or not! Is Facebook the better source for reaching a specific population? Sure no. For instance, should you want to reach people affected by albinism in North America, you would probably rather get in touch with the NOAH. So, it depends on the purpose. And on your means to leverage a big amount of data.

Of course, I shall not dispute that a large database will give you more opportunities for reaching your targets. But better do it with the maximum level of quality. I shall then cover such topics as coverage, census vs. sample, long tail later on, as dealing with large databases is mostly a question of finding out the right data in the right timing.

2. Quality

A good quality is the heir of a proper patterning. And quality always is the key to an efficient database. The specificity of quality improvement also is that it implies all records, old and new. Unlike for quantity (adding new records is an ongoing task, you seldom look back on past data), quality always requires to give a look ahead AND behind. Adding a new feature, adjusting existing attributes to new constraints, redesigning existing concepts, all this implies a full database review.

I shall cover methods and tips for improving one’s database also in the future. Still the best piece of advice I can give is simple: think twice before starting. I have added below a simple example about the long tail of the internet.

Long Tail

This chart shows a top 1,000 websites, ranked on their visits for a given time period, using their share of visit. The metrics itself is not interesting, rather the data distribution.

The top 50 of the websites (5% of the total records), very well-known, will allow a fair coverage of the activity, e.g. more or less 60% of the visits. So for a small set of data, with a high level of recognition, we could have a good understanding of the activity. Good for global strategy and high-level analysis.

Still, on the other hand the bottom 500 websites account for more or less 1.5% of the visits. Too costly to reach for if you are on a global strategy level, but of the highest interest if you are searching for a niche or a specific type of target audience.

There is no point in balancing between a small database of high quality, and an extra-large one with a high sparsity. Again, the point is to have the database calibrated for your needs. And then enrich it wisely. And you know it by now, I have already told you: this is exactly what Data Elicitation is about!

Analytics without cookies? My follow-up to #MeasureCamp IV

As mentioned in my previous post “Giving up cookies for a new internet… The third age of targeting is at your door.“, I have attended the fourth Measure Camp in London (, on March 29th. And my (voluntarily controversial) topic has been: “Web Analytics without cookies?

The subject has been introduced by the following three charts, a short introduction to what I expected to be a discussion, and a hot one it has been!

Measure Camp IV (post)

Basically, the discussion has been getting around three topics:

  • Are really cookies going to disappear, and if yes which ones and how?
  • Are cookies disapproved by the users because of their lack of privacy or rather because of some all-too aggressive third-party cookie strategies?
  • Are there any solutions, and when do we need them at last?

Topic number 1 definitely is the most controversial. It already is difficult to imagine how to deal without what has been the basics of collection, targeting and analysis. On top of this, some valid objections also have been given, such as the necessity to keep first-party cookies for a decent browsing experience as well as the request from a fair share of the users to keep ads, providing they were relevant to them. A very good follow-up has been brought by James Sandoval (Twitter: @checkyourfuel) and the BrightTag team. Thanks to them for their inputs.

Clearly, the participants were all agreeing that a cookie ban would only impact third-party ones, and occur for political reasons (maybe not before 3 to 5 years), lest a huge privacy scandal ignites an accelerated decision process. Still, a fair amount of the internet revenue would then be imperiled.

At this stage, there still remains the acceptance of cookies by the users. There is a wide consensus within the digital community that people browsing the internet accept a reasonable amount of cookie intrusion in their lives, should this generate relevant ads. Actually, I think this view is biased, as nobody has ever asked whether people would rather browse with or without ads… The question always has been between”wild” and “reasoned” ad targeting… It reminds me of an oil company asking if car drivers would rather tank diesel or lead-free, not allowing “electricity” as a valid answer…

So the question of cookie acceptance remains open in my eyes, and this may be a key driver to designing alternative solutions.

What options do we have at hand then?

The first and blatant one is a better regulation of third-party cookies, especially the ability of the user to master how, when and with whom their first-party cookies could and should be shared in an opt-in mode. The law (in the EU) theoretically rules this (see EU rules about cookie consent here), through a warning to the user about cookies, when he or she opens a new website. Still, national transcriptions and various ways of web page developments have made this law non-understandable, and mostly not actionable on a global basis.

A first step would then be to abide by the user’s choice, and give him the ability to manage his or her own cookies, sharing some, all or none of them with third-parties, as they wish. A difficult task, especially when nearly 30 government bodies are to be implied… So why not investigate non-cookie options?

In London, I have introduced two possible ways:

  1. Create a unique Id for each user, somewhat like Google’s unique Id, but managed by an independent body. My suggestion is that such an Id should belong to the whole community, like HTML or HTTP… A huge task.
  2. The other idea is mine… It would consist of the generation of anonymized profiles, based on browsing patterns. This idea I shall develop more in detail in future posts, but the idea is worth thinking, especially when one imagines that today’s user mood may not be tomorrow’s, and require a very dynamic targeting methodology…

So this hot discussion on cookies at least has initiated discussions among the digital community. It also proved that such fresh (and sometimes idealistic) views as mine are necessary to keep the digital community staying on the edge of innovation. So stay tuned, I shall go on providing food for thought so as to “shake the tree” of Measurement…

Giving up cookies for a new internet… The third age of targeting is at your door.

While preparing next week’s Measure Camp in London (, I had been wondering what would be the most interesting topic in my eyes. And my question is: “How would Web Analytics work without cookies?

Actually, last year, in September, I had read an interesting post by Laurie Sullivan, posted on the site: “Where The Next Ad-Targeting Technology Might Come From“. This had been the core of my thoughts for the past months, so I wanted to elaborate on Laurie’s post so as to introduce my own ideas about this topic.

I personally believe that the mean of collecting information from the web users through cookies is fading and soon to disappear. There are many reasons for this, including the user privacy concerns, the lack of contextuality of the cookie as well as the development of multiple access point and devices, that render such a data collection highly hazardous.

The disappearance of cookies would have an impact on at least three areas: data collection, targeting and analytics.

  • Data collection is highly based on cookies, especially when dealing with ad exposure and browsing habits. High impact.
  • Targeting is also based on cookies, as most tools use history to handle their most likely customers. High impact.
  • Analytics are also using cookies, especially for site-centric analysis as well as various page-level analysis. High impact.

Considering the high impacts, time has come for a more contextual and more behavioral targeting. We are now entering the third age of targeting. The first age had been based on sociodemographics, widely used by TV Ads or direct post mailing. The second age has been based on using past behavior to predict potential future actions, and, in internet, is widely using cookies to pursue this goal. The third age will be the age of context, targeting anonymous users with current common interests.

How will it work? One possible way: we would use network log files (provided by ISP’s or Telco’s) to collect data, organize these data with a categorization at various levels and through multiple dimensions so as to generate rich but heterogeneous user clusters and hence allow targeting of potential customers based on ad-hoc inputs. I shall elaborate in further posts, especially regarding the process, but the main advantage is the respect of privacy, especially thanks to cookie avoidance…


So, yes, giving up cookies may be difficult; this is why I believe we ought to prepare to go on a diet as of today…

And act for alternative methodologies instead of shouting “me want cookies!”

Data entry in 2014: the vivid learnings of card punching

Pondering data quality checks? Considering elaborated, automated (and expensive) schemes? Let me suggest card punching.

Punch cards? Depending on your age, you will either take me for a fool (meaning you know what a punch card is…), or simply ask “what is this?”. Let’s start with a short presentation. Basically, a punch card looks  like this:


This type of card has been used to record instructions for processing, starting in the early eighteenth century, especially in the highly standardized industries, such as textile; on a more romantic standpoint, it also is the basic material for barrel organs… It has then been used rather for data storage as of the middle of the twentieth century, and it has been one of the key drivers of the IBM success in the early ages of Data Processing.

But punch cards are not used any more, and this, since the late eighties… So, in this area of highly digitalized working environment, what are the learnings of such “analog” tools?

Actually, the cards themselves are not the important thing, it is how they have been used, and what we may re-use nowadays. The punching system required at least three major conditions to work, each of them reinforcing the quality of the whole process; these steps were: standardization, precise instructions and double data entry. Let me explain each in a few words.

  • Standardization has been a prerequisite to implementing punch cards, as they had to be readable from one machine to another… IBM made their worldwide success on their 80-character column standard (which was also the width of a terminal screen…). Standardization still is an asset for Data Entry, especially to provide an homogeneous frame to any operator (or operating system) performing data entry onto your systems. A well-design data entry system endorsed by the whole company is already a very good step towards data quality.
  • Precise instructions are needed to ensure your processing flows smoothly, as one cannot afford to have two people understand a process differently, even slightly. When given multiple choices, the operator has to know what to do in ALL the potential options, so that no human factor may be implied in quality. This is the step where machines are better, provided these machines do not have to do too much rendering. For instance, reading an image and entering the data as numeric or alphanumeric code still is quite difficult nowadays, even though the best engineers are working on it (see this Google project about cracking House Number Id’s in Google Street View).
  • Double data entry is the key quality control when talking about punching cards. The puncher/checker duet (in French, we name this “perfo/vérif”) has been the most efficient way to ensure correct data entry in the past, as these small holes in the card were not self-explanatory, and mis-punching was easy. So double data entry has been the best way to guarantee satisfactory levels of data quality, at least in a standard environment, with regular levels of investment (some automated systems, especially using the latest neural network techniques, may be more efficient, but definitely are more costly to implement…).

Let me elaborate more about Double Data Entry (DDE); DDE still is a very efficient way of improving quality. The chart below sums it up clearly:

DDE Error Rate

The percentage of recorded errors falls down dramatically when two people run the same process in parallel, and then compare their results. Similar rates are reached when running the process in a sequence, e.g. when someone checks the outputs of another (both methods are valid, the latter requesting a supervision, e.g. a different human relationship between the two data entry operators…).

I understand that these statistics have been collected ages ago, in times when the machine was good enough to be a repository, but not an operator itself, but I strongly believe that the simplest methods are still to be taken into account, at least where quality (customer satisfaction) is preferred over quantity (lowest costs). And even if one is rather keen on processing quantitatively, one cannot keep their customers on the long term without a decent (i.e. high) level of quality…

So, on top of more recent data processing ways, there is a lot to learn also from the card punching working methods. And this may certainly widely apply to your business… Should you have quality issues, you certainly would want to look into implementing such sound and simple working methods on top of your existing QC. And I would be glad to help you assess them.

Finally, there is still a way to use punch cards, even though in a humorous mode… Maybe some schizophrenic geek will love this way of googling: Punch Card Google… Still, the request processing looked a bit fast compared to my recollection 😉

Jumping over the data privacy fence to land on the optimization green grass? See the hurdle?

Back to school today. Back to work. But back to school also meant back to my desk and also to my usual good reads.

I have namely read this post today, “Beyond Privacy Exploitation Lies Huge Opportunity For Personal Data Optimization” by Max Kalehoff on MediaPost, and I believe he is going a bit quick from his own personal point of view to a more general standpoint. Nevertheless, it is a good starting point for a post about data collection management.

The general idea, e.g. people could share more data for a more optimized benefit, is nice, but only the first part is convincing, i.e. there still is a lot of unexploited data to share. As it lacks some proposal on how one could “optimize” personal data, here are some thoughts I have had on this topic.

Max’s story is nice, but his reasoning by induction is not correct. Not everyone is ready to share really personal data, like weight or daily burnt calories… Actually, this is most probably the reverse way… Such data would most probably be hidden as much as possible, and its distribution highly limited to the lowest number of people, e.g. to none but the data owner himself… A marketer may dream of openness in such a case, but unfortunately I think this is far from real life.

And still, should anyone be ready to share such very personal data, would the distribution list be very large? I doubt so… A few friends and relatives, some key helpers (a fitness coach, a Weight Watcher Leader…) and maybe a doctor, that would be it. The “challenge” idea is typical of someone leaving well with his/her weight, and only willing to lose a few pounds for a healthier life. Not the majority of people dealing with overweight (or with any other personal problem, that is). And most probably a very narrow minority, I guess.

But the idea remains of the highest interest: get passively collected data to create a huge database with a maximum of objectivity. The daily routine for Market Researcher operating in the Retail Panel Tracking, but very seldom when it comes to people. Why this? Because people are aware that they are being tracked, and this only fact introduces a bias (I remember the famous bias of Consumer Panels, where no woman would ever buy any feminine hygiene item, and no man any condom…). Where there is uneasiness or even shame, there will be no freely shared data. So what are we to do?

In this area, what is the absolute must? Not only the passivity of the data collection, but also the ignorance of the tracked people. The less they know they are tracked, the more objective the data. And no question. A marketer’s dream come true. Of course, this is very cynical, and dangerous for the company acting like this, as no panelist will be asked about their consent (or, blatantly, they would be aware they are being tracked…). If I refer to this post’s title, this would mean to reduce the privacy fence to its lowest height, so as to be able to walk over it, not even jump at all…

But this will not happen. The hurdle is high and will remain such. It is built by conduct codes, regulations, even laws, and may very difficulty be ignored. So tracked people ought to know their data are collected. And objectivity will be lost. How to get over this? Find people ready to share their data, have them on board of a panel with an opt-in clause, and run some analysis. A classical methodology to ensure privacy is not breached. The risk? A panel based only on clones of Max Kalehoff, introducing another bias, the will to be tracked…

So, is there no way to get over the fence? There are some of course. The cookies on the web are probably the most famous af all. Every website uses some. And they are (very) intrusive. And some people started to complain, and even to request they should be blocked or erased. A solution will be required here too for user targeting and browsing behavior optimization.

So Max, I do not have the solution either. Not yet… But I’m working on it.