Big Data & Probability Sampling – Crossing the Chasm

In this article our Managing Director,  Carsten Broich,  would like to outline how traditional probability sampling can be combined with BIG Data in such a way that efficiency gains are reached while not introducing a bias or increased coverage error.


With an increase of mobile phone penetration worldwide, social researchers are turning towards telephone interviewing and away from traditional face-to-face interviews. Mobile phone sample provide large coverage of a target population within the country but in many cases lack information such as location, age, gender. In traditional landline-only telephone interviewing, it was possible to predetermine a location based on the area or exchange code which allowed stratification by location areas.

When looking at sampling frames in many developing countries, the mobile phone proportion of the total telephone sample is the larger proportion and will most likely increase further. To support fieldwork agencies and their data collection it can be advantageous to include other parameter to plain cell phone number which currently only has the number itself, the cell phone provider and in some cases an activity flag.

Figure1: Data flow for enhancing RDD frames with Big Data


In this section, there are examples outlining some oft he possible applications.

1.      Gender Matching

For social research in a specific part of the world, interviews are conducted equal gender interviewers. In the case of non-matching gender, an appointment is scheduled and the interview is conducted later by somebody with the same gender. Currently, around 25-30% of actual cell phone sample can be enriched with a gender indication so that during fieldwork, no more gender matching is required for these records. Especially when the sample is replicated, there should be no further bias included as pre-screened vs. non-screened sample receives the sample priority during fieldwork (sufficient tries till maximum tries are reached. Upon completion of replicate, the next replicate can be opened).

2.     Removing non-eligible respondents

Some studies might aim at people within a specific area within a country (consider the Flemish-speaking population of Belgium only), the desired age group (eg. elderly people 65+), people with specific ethnic backgrounds. However, when making use of a landline or mobile RDD frame, the entire landline/mobile population would be included thus that a large share of the respondents would be screened our as they are not in our desired target population. This subsequently can lead to increased frustration among interviewers but also an increase in cost.

By blending Big Data with phone numbers, it is possible to screen out respondents before the actual fieldwork period. These respondents could be in a non-desired target region, are of a different socio-cultural background or do not match the age criteria. By removing only the respondents that are not in target group, there is no extra coverage error introduced

3.     Pre-Screening based on Big Data

Application 2 could also be reversed and only people could be considered that match specific criteria. Advantage here is that the segmenting and fieldwork efficiency will become very precise, however this is at a cost of coverage. Not all respondents (phone numbers) can be matched with Big Data thus that a coverage error is introduced.

Experience and recent projects

Past project that we worked on were related to targeting 18-34 year old respondents in addition to a standard Dual-Frame RDD sample as completed interviews for this age group was a challenge for our client. Besides that we targeted specific areas within the Middle East with mobile phone sample as landline frames offer very low penetration levels while covering a very large area. At the current stage we are working on merging our RDD frames with Big Data so that clients can decide whether they want to use true RDD sample, enhanced RDD sample or standard lifestyle sample (collected from other surveys).

With the increase in available data and machine learning we aim on scaling up this technology rapidly to allow global coverage and for the enhanced RDD sample.

About Sample Solutions

Sample Solutions BV is the leading telephone sample provider for B2B, B2C and lifestyle sample with global coverage. On yearly base we provide more than 40 million consumer records and more than 10 million B2B records.

Feel free to get in contact with our  client service managers to discuss how our telephone sample can improve your CATI research project.

  • Home
  • Solutions
  • Who We Are
  • Contact