Building the elephant

Linking big datasets from various sources/registers is a daunting task. However, large-scale data-linking is not new

5 min read Last Updated : Nov 23 2019 | 12:38 AM IST

The blind men of the parable went on debating whether the elephant was like a winnowing basket (ear), or a plowshare (tusk), or a plow (trunk), or a pillar (foot), by touching only one part of the body. However, instead of fighting for their own beliefs, they could possibly try to build an image of the animal by combining all their data. That’s data-linking, of course. People have become smarter, and gather information from different sources by creating a new and richer dataset, which is of tremendous help in policy making.

Linkages among medical databases, such as electronic health records (EHRs), EHRs across multiple healthcare settings, health insurance claims, and patient-generated data are becoming increasingly important in delivering high-quality health care in different countries. Linking databases of Aadhaar, permanent account number, Voter ID, bank accounts and mobile numbers, for example, has been dominating our lives in the recent past. There has been a paradigm shift in the census methodology over the last two decades or so — especially in Europe, where the traditional questionnaire-based censuses have been replaced by register-based ones. In fact, data from population registers, municipalities, jobs files, social security administration, pensions and life insurance benefits, schools, different types of tax records, health and hospital are now assembled to prepare censuses in many places. In India, the home minister recently asked officials to frame a plan to combine the databases such as Aadhaar and voter card into the upcoming 2021 census database.

Linking big datasets from various sources/registers is a daunting task. However, large-scale data-linking is not new; it has been successfully performed elsewhere. For example, in 2010, Luxembourg-based Eurostat launched a micro data-linking project in European business statistics with 10 participating countries. The aim was to link data from the International sourcing survey with structural business and international trade in goods statistics.

The objective of data-integration is to identify and combine information from corresponding records on different source datasets. Standard software packages such as R RecordLinkage package, Merge ToolBox, Link Plus, and Stanford Entity Resolution Framework (SERF) are available for data linking. However, it’s certainly not easy to combine different registers concerning 1.35 billion people, for example. Even the name of a person might have been recorded differently in different registers. Are Debdulal Banerjee, Deb Dulal Benerjee, Deb Dulal Bannerjee, Debu Bandyopadhyay, and D Bandyopadhyay, having the same date of birth, the same person? There might have been typographical errors in some registers, but they might also be different individuals. There might be various kinds of missingness in data, and handling missing data may affect the performance and reliability of the linkage approach. These are issues which must be treated with utmost care.

Typically, while linking two datasets, a record linkage algorithm first fixes a set of linking variables, called identifiers (such as name, address), common in both datasets, and a numerical weight is calculated for each compared pair, which reflects the degree of confidence of the paired records, representing the same person or entity. Then, the pairs with weights exceeding some pre-specified matching are declared to be matches.

Several techniques are available for such purposes. For example, a “deterministic” algorithm checks whether records agree or disagree on a given set of identifiers in “all-or-nothing” fashion — either in a single step or in multiple steps. An “exact deterministic” linkage requires an exact match on all identifiers, while an “approximate or iterative deterministic” linkage requires an exact match on one of several rounds of matching, but not on all possible identifiers. For example, a procedure may rely on matching two or more of the following identifiers: Year of birth, day of birth, middle initial, and gender.

However, it is likely that certain identifiers (like date of birth) have more discriminatory power than others. More complicated “probabilistic” strategies assess the discriminatory power of each identifier, and also the likelihood that two records are of the same person based on whether they agree or disagree on different identifiers. Matching probabilities may be obtained exogenously, reflecting past experience or expert opinion (e.g. Fellegi-Sunter approach), or calculated endogenously (for example, expectation-maximisation algorithm). More sophisticated methods include a “deterministic” component into the “probabilistic” strategy, and such a combined strategy may be characterised as “hybrid probabilistic-deterministic” method.

In any case, the procedure to be applied should be “iterative” in nature. Moreover, a proper verification of the performance of the data-linking procedure should be done by taking a random sample of linked data. Certainly, matching errors may occur. The errors concerning clubbing a pair from two registers might be of two types — false positive and false negative. The associated data scientists need to devise effective algorithm, depending on the nature of registers to be linked to ensure that the data-linking becomes smooth, sufficiently accurate and, of course, hassle-free for the common people. There should be a convenient guideline to deal with the “errors” subsequently.

Certainly, it’s a mammoth task to build up the “elephant” by correctly assembling different types of “registers”. However, once the complete elephant is formed, it might create waves of opportunities for more complex and expanded policies and research – no doubt about that.

The writer is a professor of Statistics at the Indian Statistical Institute, Kolkata