Category Archives: Categorization

Data Elicitation in three steps (1/3): Data Patterning

In my previous post, I have outlined what Data Elicitation was about. I have introduced the three areas required for a proper eliciting process, e.g. Data Patterning, Data Enrichment, and Data Analytics. This post will deal with the first of these areas, “patterning“.

First, it is worth explaining why I chose this word. I have used a typical concept from the textile industry (or, to be a bit more ambitious from the “Haute Couture” world). In this area, a pattern is the intermediate stage between the designer’s sketches and the item production, a formalized plan enabling industrial planning. On one hand, it still is a concept, like the sketches, as it is purely paper. But on the other hand, it already is production, as it includes all the necessary information for implementing a full production process. This is what patterning is all about, allowing people’s ideas to become physical shapes.

Patterning data is an essential step for data management,as it allows to take stakeholder wishes and technical constraints into account, and prepare an optimal project and development planning.. No database is suitable, if not driven by clients’ needs and requests. No data analysis is relevant if not aligned with the previously agreed pattern.

A few key questions in this respect:

  • What are my data made of and, even more important, made for? → as one finds its way better when the ultimate goal is known…
  • How can data sets be best organized? → the proper content ought to be in the proper place
  • What content do I need to store to get the best out of my data? → since not every piece of information may be worth keeping
  • How may my data relations be best optimized? → data are more useful when they are properly linked and aligned

I have summarized a typical process in the table below, in three columns; a fashion analogy, some project management steps and a basic (softened) example inspired by one of my previous experiences in the mobile world:

Patterning steps

The parallel with the creative flow in the Fashion industry is strong, as it shows that the first half of the process (steps #1 to #4) is the true added-value to the whole content, the second half being more execution. It is clear that botching the patterning phase will impede the proper completion of the project. In the case above, the proposed solution could be summarized in a small chart, as the information laid in two fields of the provided log files.

Patterning chart

The two attributes (fields) that were present in the log files are marked here in blue.

The TAC (the first part of the IMEI code) may be directly used, as it relates to one device model; the master database is maintained by the GSMA, and is delivered to its members (including Telecom Operators).

The User Agent is more complicated, as it includes entangled information; parsing the User Agent will allow namely to identify browser, OS and type of connection that have been used. Still, it does not require additional information, only a good content analysis and a solid set of coding rules.

The combination of these four items creates a unique identifier, which is not specifically related to given users, but creates homogeneous groups, sharing similar technical conditions (hardware, software, network). Each group will then receive contents adapted to their specific conditions, thereby optimizing their browsing experience and consequently increasing engagement.

As this blog is aiming at a large public, I chose to keep a rather simple example. Of course, should the matter be more intricate, the skills I have built up over my years of experience in Data Management will even be more valuable. Feel free to ask more about patterning or other fields of Data Elicitation, I shall be glad to elaborate customized solutions for your business.

Data Elicitation in three steps (2/3): Data Enrichment

The second step of elicitation is enrichment. Once your data have been patterned, you have to design their look and feel. This is what data enrichment is about.

Again key questions in this area:

  • How do I qualify my data? → a categorization scheme is key to facilitate relevant data extraction for future analysis
  • What may be missing or on the contrary is uselessly kept? → choosing one’s data set is necessary for correct data acquisition
  • What type of addition is relevant and what is really useful? → as existing information may not be sufficient for achieving marketing studies
  • How do I acquire additional attributes at an optimum price? → collect, derive or generate additional data at proper cost

No question, your data are rich, especially if you can use them easily thanks to an appropriate patterning. But they certainly can be richer. Much richer. And there are hundreds of ways to enrich data, but only two dimensions to consider, quantity and quality.

You may have tons of data, and still this may not fit your purposes. Or on the contrary have scarce resources, but with a very high (and maybe hidden) value. Market Research companies used to name this data enrichment processing “coding the dictionary”, a phrase showing the richness of this process, both on the quantity (the number of words) and on the quality side (the clarity of the definitions). Getting the relevance out of the data is definitely a precious skill, and one of my own key proficiencies.

I shall definitely develop both aspects of data enrichment in future posts but I wanted to cover them shortly in this introduction.

1. Quantity

One always seems to be missing data. More e-mails for more direct marketing contacts, more socio-demographics for a better segmentation, more inputs from the sales force for a more precise CRM, more, more, more…

As usual, this may be true. Or not! Is Facebook the better source for reaching a specific population? Sure no. For instance, should you want to reach people affected by albinism in North America, you would probably rather get in touch with the NOAH. So, it depends on the purpose. And on your means to leverage a big amount of data.

Of course, I shall not dispute that a large database will give you more opportunities for reaching your targets. But better do it with the maximum level of quality. I shall then cover such topics as coverage, census vs. sample, long tail later on, as dealing with large databases is mostly a question of finding out the right data in the right timing.

2. Quality

A good quality is the heir of a proper patterning. And quality always is the key to an efficient database. The specificity of quality improvement also is that it implies all records, old and new. Unlike for quantity (adding new records is an ongoing task, you seldom look back on past data), quality always requires to give a look ahead AND behind. Adding a new feature, adjusting existing attributes to new constraints, redesigning existing concepts, all this implies a full database review.

I shall cover methods and tips for improving one’s database also in the future. Still the best piece of advice I can give is simple: think twice before starting. I have added below a simple example about the long tail of the internet.

Long Tail

This chart shows a top 1,000 websites, ranked on their visits for a given time period, using their share of visit. The metrics itself is not interesting, rather the data distribution.

The top 50 of the websites (5% of the total records), very well-known, will allow a fair coverage of the activity, e.g. more or less 60% of the visits. So for a small set of data, with a high level of recognition, we could have a good understanding of the activity. Good for global strategy and high-level analysis.

Still, on the other hand the bottom 500 websites account for more or less 1.5% of the visits. Too costly to reach for if you are on a global strategy level, but of the highest interest if you are searching for a niche or a specific type of target audience.

There is no point in balancing between a small database of high quality, and an extra-large one with a high sparsity. Again, the point is to have the database calibrated for your needs. And then enrich it wisely. And you know it by now, I have already told you: this is exactly what Data Elicitation is about!

Giving up cookies for a new internet… The third age of targeting is at your door.

While preparing next week’s Measure Camp in London (, I had been wondering what would be the most interesting topic in my eyes. And my question is: “How would Web Analytics work without cookies?

Actually, last year, in September, I had read an interesting post by Laurie Sullivan, posted on the site: “Where The Next Ad-Targeting Technology Might Come From“. This had been the core of my thoughts for the past months, so I wanted to elaborate on Laurie’s post so as to introduce my own ideas about this topic.

I personally believe that the mean of collecting information from the web users through cookies is fading and soon to disappear. There are many reasons for this, including the user privacy concerns, the lack of contextuality of the cookie as well as the development of multiple access point and devices, that render such a data collection highly hazardous.

The disappearance of cookies would have an impact on at least three areas: data collection, targeting and analytics.

  • Data collection is highly based on cookies, especially when dealing with ad exposure and browsing habits. High impact.
  • Targeting is also based on cookies, as most tools use history to handle their most likely customers. High impact.
  • Analytics are also using cookies, especially for site-centric analysis as well as various page-level analysis. High impact.

Considering the high impacts, time has come for a more contextual and more behavioral targeting. We are now entering the third age of targeting. The first age had been based on sociodemographics, widely used by TV Ads or direct post mailing. The second age has been based on using past behavior to predict potential future actions, and, in internet, is widely using cookies to pursue this goal. The third age will be the age of context, targeting anonymous users with current common interests.

How will it work? One possible way: we would use network log files (provided by ISP’s or Telco’s) to collect data, organize these data with a categorization at various levels and through multiple dimensions so as to generate rich but heterogeneous user clusters and hence allow targeting of potential customers based on ad-hoc inputs. I shall elaborate in further posts, especially regarding the process, but the main advantage is the respect of privacy, especially thanks to cookie avoidance…


So, yes, giving up cookies may be difficult; this is why I believe we ought to prepare to go on a diet as of today…

And act for alternative methodologies instead of shouting “me want cookies!”

Data entry in 2014: the vivid learnings of card punching

Pondering data quality checks? Considering elaborated, automated (and expensive) schemes? Let me suggest card punching.

Punch cards? Depending on your age, you will either take me for a fool (meaning you know what a punch card is…), or simply ask “what is this?”. Let’s start with a short presentation. Basically, a punch card looks  like this:


This type of card has been used to record instructions for processing, starting in the early eighteenth century, especially in the highly standardized industries, such as textile; on a more romantic standpoint, it also is the basic material for barrel organs… It has then been used rather for data storage as of the middle of the twentieth century, and it has been one of the key drivers of the IBM success in the early ages of Data Processing.

But punch cards are not used any more, and this, since the late eighties… So, in this area of highly digitalized working environment, what are the learnings of such “analog” tools?

Actually, the cards themselves are not the important thing, it is how they have been used, and what we may re-use nowadays. The punching system required at least three major conditions to work, each of them reinforcing the quality of the whole process; these steps were: standardization, precise instructions and double data entry. Let me explain each in a few words.

  • Standardization has been a prerequisite to implementing punch cards, as they had to be readable from one machine to another… IBM made their worldwide success on their 80-character column standard (which was also the width of a terminal screen…). Standardization still is an asset for Data Entry, especially to provide an homogeneous frame to any operator (or operating system) performing data entry onto your systems. A well-design data entry system endorsed by the whole company is already a very good step towards data quality.
  • Precise instructions are needed to ensure your processing flows smoothly, as one cannot afford to have two people understand a process differently, even slightly. When given multiple choices, the operator has to know what to do in ALL the potential options, so that no human factor may be implied in quality. This is the step where machines are better, provided these machines do not have to do too much rendering. For instance, reading an image and entering the data as numeric or alphanumeric code still is quite difficult nowadays, even though the best engineers are working on it (see this Google project about cracking House Number Id’s in Google Street View).
  • Double data entry is the key quality control when talking about punching cards. The puncher/checker duet (in French, we name this “perfo/vérif”) has been the most efficient way to ensure correct data entry in the past, as these small holes in the card were not self-explanatory, and mis-punching was easy. So double data entry has been the best way to guarantee satisfactory levels of data quality, at least in a standard environment, with regular levels of investment (some automated systems, especially using the latest neural network techniques, may be more efficient, but definitely are more costly to implement…).

Let me elaborate more about Double Data Entry (DDE); DDE still is a very efficient way of improving quality. The chart below sums it up clearly:

DDE Error Rate

The percentage of recorded errors falls down dramatically when two people run the same process in parallel, and then compare their results. Similar rates are reached when running the process in a sequence, e.g. when someone checks the outputs of another (both methods are valid, the latter requesting a supervision, e.g. a different human relationship between the two data entry operators…).

I understand that these statistics have been collected ages ago, in times when the machine was good enough to be a repository, but not an operator itself, but I strongly believe that the simplest methods are still to be taken into account, at least where quality (customer satisfaction) is preferred over quantity (lowest costs). And even if one is rather keen on processing quantitatively, one cannot keep their customers on the long term without a decent (i.e. high) level of quality…

So, on top of more recent data processing ways, there is a lot to learn also from the card punching working methods. And this may certainly widely apply to your business… Should you have quality issues, you certainly would want to look into implementing such sound and simple working methods on top of your existing QC. And I would be glad to help you assess them.

Finally, there is still a way to use punch cards, even though in a humorous mode… Maybe some schizophrenic geek will love this way of googling: Punch Card Google… Still, the request processing looked a bit fast compared to my recollection 😉

Data Elicitation : my professional new start in 2014

As you could read it last week in Revival of a digital non-nativeI am now more qualified than ever in Digital Analytics, ready to write the first pages of my professional new start.

It has been very nice to receive a high number of positive feedback and to state the concrete interest my last post has aroused. As announced last week, I shall now elaborate what I am at. This post defines the core business of Data Elicitation. Further ones (in a series of 3) will give much more details about specific contributions closely linked with my own proficiency, and answering concerns expressed by marketers, namely through this study by StrongView (2014 Marketing Survey). Key areas are:

1. Data patterning

The original sin of Big Data is its formlessness. So as to be able to use these data, and get the best out of them, one must organize and structure them first. This is what patterning is about.

Or course, your engineers will claim they have built the best database ever, and that it should answer any question you have. This may be true. Or not. Actually, many databases are built under technical constraints, with very little regard to usage and user experience, let alone to marketing and strategy needs. My own experience testifies that an efficient use of data is first built upon a correct understanding of the client requests, i.e. that the initial step is not drawing the plan, but thinking about how it would best fit its goal. This always has been a key driver of my action, especially when building up various new services in the marketing information business. I am a resolute data patterner.

2. Data enrichment

Your data are rich, especially if you can use them easily thanks to an appropriate patterning. But they certainly can be richer. Much richer. And most certainly not at high costs. This is what enrichment is all about.

You may have tons of data, and still this may not fit your purposes. Or on the contrary have small databases, but with a very high (and maybe hidden) value. And enriching is not only adding external information, it is also deriving, translating, cross-checking existing sources. Market Research companies used to name this data enrichment process “coding the dictionary”, a phrase showing the vastness and complexity of this process.  Getting the relevance out of the data is definitely a precious skill, and one of my own key proficiencies.

3. Data analytics

Now, your data are accessible and usable. Fine. And what next? Getting the best out of your data is not always easy, as the meaning may either be blurred or the solution to the problem lost as a needle in a haystack. This is what analytics are about.

And once your data are fit for use, you need to find the proper tactics and strategy to reach your goal, i.e. get them to talk and find the solution to your issues or validate your assumptions. This requires a fair analytic technique, but also a good flair for identifying where the gems are hidden… In this respect, as a seasoned market research expert with a solid digital background, I shall help you identify where to dig to get the best out of your data.

So in the end, this whole process of patterning, enriching and analyzing data may be summarized under one single word: elicitation. I have chosen Data Elicitation as an umbrella designation for running all these processes and bringing them together as a service.

On a practical level, my door remains open to any CEO who would require my exclusive working force to set up their data marketing corporate strategy (e.g. hire me). Still, the current market conditions, notably in France, imply that flexibility is key, especially in the light of project-driven action. This is why I offer my (invaluable) resource also as a contractor. So? Drown in data? Or searching them desperately? And in need of elicitation? Let us keep in touch and let 2014 be the year for your ultimate data elicitation!

Generation C: being forty-something may be trendy again…

Generation C. Gen C. A new buzzword, announcing the downfall of the Generation Y (which itself outdated Generation X a few years ago…). And definitely, the sign that, in our digital world, the age barrier, along with other old-fashioned sociodemographic factors, is tumbling down.

What are we talking about? The C stands for “connected”, but also for “creative”, “communicative”, “collaborative”, and its activities are driven by two other c’s, “content” and “cloud”. What is interesting in this new concept, is that it is not related to the age of the user itself. Generation X and Y used to refer to educational schemes, the X one driven by the upsurge of TV in our life, through numerous channels and dedicated programs, while the Y one was linked to the computer penetration in our homes along with the spread of an ever easier internet access and digitalized exchanges. For a parallel comparison of Gen Y and C, a very good read is “Gen C, Gen Y, Gen who?” by Jake Pearce.

Gen C is different. It is driven by usage, beyond age, education and culture. This implies at least two major consequences for the matters I like to deal with.

In the first place, this is another sign for the  lack of relevancy of sociodemographics. It is now clear that the target for marketing campaigns cannot be defined solely on the basis of age, gender, geographical location or whatever predetermined personal attribute. Target ought to be meant as a group of common interest, sharing same habits, same interests, same fads… This is Gen C vs. X or Y. How we may achieve a satisfying targeting is another (huge) question. Basically the issue may be summarized as follows: “cookie or not cookie?”, and I shall deal with this matter in future posts, showing how a long-life clustering and targeting methodology is to be performed without cookies, especially in this era of acute privacy concerns.

In the second place, then, I believe that the emerging Generation C is a sign for a new deal in Human Resources Management in the marketing area, especially where digital and new technologies are concerned. I have attended a Club Digital conference in Paris last Tuesday (Nov 5th), whose subject was “Recruitment and Career in the Digital Business: key factors of success” (more info in French on, or on Twitter with keyword #DigitalFR). One of the key learnings is that an efficient digital strategy is to be a mix of design and technique, of guidance and execution, of marketing and IT. So of course Digital Natives in their early twenties are badly needed, but on top of their knowledge, some additional experience is required… Guidance, organization, management, in a few words, alignment with the company’s strategic resources at high level (marketing, human resources, finance, sales…) is also at stake. Understanding business challenges, deciphering client needs, organizing data management may imply more than academic expertise. And for this, one needs seniority, some business acumen, in a few words, people having already lived successful challenges, as well as experienced failures, that taught them lessons…

Thus, according to the members of the conference panel, forty-something year old Generation C managers are even a scarce resource, as the need for digital resources is immense, and the share of experienced senior managers in this area very tiny. The Geek world is now ready to welcome Oldies Goldies: this is really good news!

With my Windows phone always connected to the world, an in-depth expertise in big data management, along with my digital experience,  I should be considered as a trendy forty-something year old manager. Good start, isn’t it?


Jumping over the data privacy fence to land on the optimization green grass? See the hurdle?

Back to school today. Back to work. But back to school also meant back to my desk and also to my usual good reads.

I have namely read this post today, “Beyond Privacy Exploitation Lies Huge Opportunity For Personal Data Optimization” by Max Kalehoff on MediaPost, and I believe he is going a bit quick from his own personal point of view to a more general standpoint. Nevertheless, it is a good starting point for a post about data collection management.

The general idea, e.g. people could share more data for a more optimized benefit, is nice, but only the first part is convincing, i.e. there still is a lot of unexploited data to share. As it lacks some proposal on how one could “optimize” personal data, here are some thoughts I have had on this topic.

Max’s story is nice, but his reasoning by induction is not correct. Not everyone is ready to share really personal data, like weight or daily burnt calories… Actually, this is most probably the reverse way… Such data would most probably be hidden as much as possible, and its distribution highly limited to the lowest number of people, e.g. to none but the data owner himself… A marketer may dream of openness in such a case, but unfortunately I think this is far from real life.

And still, should anyone be ready to share such very personal data, would the distribution list be very large? I doubt so… A few friends and relatives, some key helpers (a fitness coach, a Weight Watcher Leader…) and maybe a doctor, that would be it. The “challenge” idea is typical of someone leaving well with his/her weight, and only willing to lose a few pounds for a healthier life. Not the majority of people dealing with overweight (or with any other personal problem, that is). And most probably a very narrow minority, I guess.

But the idea remains of the highest interest: get passively collected data to create a huge database with a maximum of objectivity. The daily routine for Market Researcher operating in the Retail Panel Tracking, but very seldom when it comes to people. Why this? Because people are aware that they are being tracked, and this only fact introduces a bias (I remember the famous bias of Consumer Panels, where no woman would ever buy any feminine hygiene item, and no man any condom…). Where there is uneasiness or even shame, there will be no freely shared data. So what are we to do?

In this area, what is the absolute must? Not only the passivity of the data collection, but also the ignorance of the tracked people. The less they know they are tracked, the more objective the data. And no question. A marketer’s dream come true. Of course, this is very cynical, and dangerous for the company acting like this, as no panelist will be asked about their consent (or, blatantly, they would be aware they are being tracked…). If I refer to this post’s title, this would mean to reduce the privacy fence to its lowest height, so as to be able to walk over it, not even jump at all…

But this will not happen. The hurdle is high and will remain such. It is built by conduct codes, regulations, even laws, and may very difficulty be ignored. So tracked people ought to know their data are collected. And objectivity will be lost. How to get over this? Find people ready to share their data, have them on board of a panel with an opt-in clause, and run some analysis. A classical methodology to ensure privacy is not breached. The risk? A panel based only on clones of Max Kalehoff, introducing another bias, the will to be tracked…

So, is there no way to get over the fence? There are some of course. The cookies on the web are probably the most famous af all. Every website uses some. And they are (very) intrusive. And some people started to complain, and even to request they should be blocked or erased. A solution will be required here too for user targeting and browsing behavior optimization.

So Max, I do not have the solution either. Not yet… But I’m working on it.