Tag Archives: IBM

Data entry in 2014: the vivid learnings of card punching

Pondering data quality checks? Considering elaborated, automated (and expensive) schemes? Let me suggest card punching.

Punch cards? Depending on your age, you will either take me for a fool (meaning you know what a punch card is…), or simply ask “what is this?”. Let’s start with a short presentation. Basically, a punch card looks  like this:

punch_card.75dpi.rgb

This type of card has been used to record instructions for processing, starting in the early eighteenth century, especially in the highly standardized industries, such as textile; on a more romantic standpoint, it also is the basic material for barrel organs… It has then been used rather for data storage as of the middle of the twentieth century, and it has been one of the key drivers of the IBM success in the early ages of Data Processing.

But punch cards are not used any more, and this, since the late eighties… So, in this area of highly digitalized working environment, what are the learnings of such “analog” tools?

Actually, the cards themselves are not the important thing, it is how they have been used, and what we may re-use nowadays. The punching system required at least three major conditions to work, each of them reinforcing the quality of the whole process; these steps were: standardization, precise instructions and double data entry. Let me explain each in a few words.

  • Standardization has been a prerequisite to implementing punch cards, as they had to be readable from one machine to another… IBM made their worldwide success on their 80-character column standard (which was also the width of a terminal screen…). Standardization still is an asset for Data Entry, especially to provide an homogeneous frame to any operator (or operating system) performing data entry onto your systems. A well-design data entry system endorsed by the whole company is already a very good step towards data quality.
  • Precise instructions are needed to ensure your processing flows smoothly, as one cannot afford to have two people understand a process differently, even slightly. When given multiple choices, the operator has to know what to do in ALL the potential options, so that no human factor may be implied in quality. This is the step where machines are better, provided these machines do not have to do too much rendering. For instance, reading an image and entering the data as numeric or alphanumeric code still is quite difficult nowadays, even though the best engineers are working on it (see this Google project about cracking House Number Id’s in Google Street View).
  • Double data entry is the key quality control when talking about punching cards. The puncher/checker duet (in French, we name this “perfo/vérif”) has been the most efficient way to ensure correct data entry in the past, as these small holes in the card were not self-explanatory, and mis-punching was easy. So double data entry has been the best way to guarantee satisfactory levels of data quality, at least in a standard environment, with regular levels of investment (some automated systems, especially using the latest neural network techniques, may be more efficient, but definitely are more costly to implement…).

Let me elaborate more about Double Data Entry (DDE); DDE still is a very efficient way of improving quality. The chart below sums it up clearly:

DDE Error Rate

The percentage of recorded errors falls down dramatically when two people run the same process in parallel, and then compare their results. Similar rates are reached when running the process in a sequence, e.g. when someone checks the outputs of another (both methods are valid, the latter requesting a supervision, e.g. a different human relationship between the two data entry operators…).

I understand that these statistics have been collected ages ago, in times when the machine was good enough to be a repository, but not an operator itself, but I strongly believe that the simplest methods are still to be taken into account, at least where quality (customer satisfaction) is preferred over quantity (lowest costs). And even if one is rather keen on processing quantitatively, one cannot keep their customers on the long term without a decent (i.e. high) level of quality…

So, on top of more recent data processing ways, there is a lot to learn also from the card punching working methods. And this may certainly widely apply to your business… Should you have quality issues, you certainly would want to look into implementing such sound and simple working methods on top of your existing QC. And I would be glad to help you assess them.

Finally, there is still a way to use punch cards, even though in a humorous mode… Maybe some schizophrenic geek will love this way of googling: Punch Card Google… Still, the request processing looked a bit fast compared to my recollection 😉

“Big Data”: new frontier or black hole?

Stéphane Richard (CEO of Orange) at the Avignon Forum in 2012: “Big Data, this is private data business, and it is scaring” (in French: “Le Big Data, c’est le commerce des données personnelles et c’est effrayant”)

In my eyes, there is much more than threat, when thinking of the future of Big Data. First, let us ask the relevant question: what is the definition of “Big Data”? A few hints, picked over the web:

on Wikipedia, the definition tells that ” Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” Simple.

Too simple. As mentioned on Mike2.0 wiki, “not all large datasets are big”. Certainly, some small datasets may be considered “big”, as they are complex, and some large datasets are not “big”, as their structure is well-known and easy to handle. But still, complexity may be different, from one point of view to the other (for my part, I do consider mobile internet data as “big”, whereas Mike2.0 only consider them “large”)… For reference, the link is here: http://mike2.openmethodology.org/wiki/Big_Data_Definition

More elaborated, Gartner’s definition (from their IT Glossary) says that “Big Data in general is defined as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”. Similarly, IBM says that “Big Data spans four dimensions: Volume, Velocity, Variety and Veracity”. the 4 V’s. Somewhat of a marketing gimmick, but not so bad, after all…

When looking into definitions that are more Digital-Analytics-oriented, I will stay with Avinash Kaushik’s definition: “Big Data is the collection of massive databases of structured and unstructured data”. So basically, a promise for some bright analytics, but that will be hard to find, a classical needle in the haystack, or more exactly, a jewel among tons of rocks.

My own definition will then be a bit more provocative: “Big Data is a set of data, that is too big for any actual processing capacity”. Let me elaborate.

From the start, my career has lied mostly with Retail Tracking Data usage. In this area, bimonthly manual collection of purchases and inventories used to be the norm at the end of the eighties. And then came the first Big Data rupture, e.g. the introduction of weekly electronic data collection, using EAN/UPC-based datasets. 1,000 times more data points. Big data for the early nineties standard. Small beer twenty years later.

Similarly, when the same weekly electronic data collection – still based on samples – switched to daily census at the end of the nineties, data volumes multiplied then again by more than factor 100. Big data again. Now common for any Retail Panel service.

Again, when the same data collections were made available as transactional data records, showing all possible disaggregated data points – especially thanks to the upsurge of retailer fidelity cards – data volumes were again multiplied by factor 1,000. Big data another time. Now about to be handled more or less properly by Market Research companies. Awaiting the next big data frontier?

So definitely, data that are named “big” today are on the edge of our current ability to handle such data sets. Tomorrow, other data sets will overtake this “big” status, maybe with the addition of geo-location information or other causal data (digital journey for instance).

Ever more data for ever more information. Or the new frontier that leads to the black hole. Why? Because too much data may mean too much insights.

That is the drawback of big data. Too much data. Too many interesting things. Too many insights. The black hole of information, fully absorbing our capacity to catch new trends and key insights.

The bigger the data, the more complicated it is to extract the key information that will trigger new ideas, new business, new revenues. As mentioned in this blog post from Mediapost (http://www.mediapost.com/publications/article/191088/are-insights-cheap-commodities.html#axzz2IR4TeXCz), the key issue is not any more to find an insight, it is to find THE insight. We are not to break the next frontier any more, we are to find out in which direction we ought to search where to go.

We have to do this. Quickly. Before the black hole of big data swallows what remains of the dimming light of key insights…

So, to close this blog post, and to start the discussion, a very interesting point of view by Jannis Kallinikos from the LSE: http://www.paristechreview.com/2012/11/16/allure-big-data/