How to build a secure cell phone data pipeline

Data gaps and innovation

In times of need, traditional data sources like census and survey data quickly become obsolete. Call Detail Record (CDR) data is generated with every call and SMS, including timestamp and location information. In the context of COVID-19 and other crises, the analysis of CDR data can inform politics. However, this requires an institutional framework, infrastructure and analytical capabilities to build an appropriate data pipeline that enables policy makers to react quickly to a changing situation. In a recently published working paper, we outline how such a data pipeline was built in The Gambia.

With the outbreak of COVID-19 in The Gambia, the government announced a national health emergency with restrictions on economic activity. It became clear that continued social distancing would incur high costs for households and businesses, and there was interest in creating an evidence base to track the impact of these restrictions on mobility patterns. The World Bank, Gambia Bureau of Statistics (GBoS), Public Utilities Regulatory Authority (PURA) and the University of Tokyo had previously formed a partnership to study the use of CDR data to create an evidence base for policy design.

Creating a CDR data pipeline

The partnership created the institutional basis for regulators to collect aggregated, anonymized data from mobile phone companies. With the interest sparked by the COVID-19 emergency, the team sought to set up a data pipeline that would allow for rapid analysis while adhering to best practices around data confidentiality and security. This included the following activities:

data access

  • Strengthening Existing Data Collection Protocols: As part of its mandate to monitor the quality of mobile network services, PURA established a central data repository. After obtaining the necessary approvals, the team worked with the system administrator to include additional indicators as part of this routine monitoring for use in the analysis. This minimized the reporting burden for MNOs and made compliance easier.
  • Define necessary processes to ensure data protection and privacy: Data protection and privacy are of the utmost importance for PURA and MNOs[1]; their operation is strictly in accordance with national laws. The data used for this initiative has been anonymized to protect personal information. All data processing took place at PURA’s premises and the team only had access to aggregated summary statistics. Open source tools for data de-identification and analysis were provided along with hands-on training for PURA’s and MNO’s technicians.

A system on the regulator’s premises

  • System Requirements Specification: The system requirements were specified based on the data size and computing power required for the analysis using a Hadoop platform.[2] To ensure an extra level of security, data collected for analysis was protected by firewalls and stored on a separate on-site server, with remote access strictly limited to key researchers and system administrators.
  • Procurement of hardware and installation of a temporary machine: After the specification of the system requirements, the procurement of hardware was initiated. Faced with procurement delays, the team installed a makeshift machine on the regulator’s premises. It was a small mainframe capable of ingesting data provided by MNOs and performing non-computational intensive analysis to assess data quality to ensure data stays on-premises.

capacity building

  • Workshops and relationship building: Prior to the outbreak of COVID-19, the team had organized a series of workshops and training sessions that helped all parties understand their roles and skills. These exercises build trust and provided an opportunity to discuss experiences from other countries. It helped inform collaboration when all interaction had to be moved to remote work due to COVID-19.

Deploying the COVID-19 Response Project

The data pipeline pulls anonymized CDR data provided by MNOs and aggregated on the premise of the regulator before being made available to decision makers. The team used methods developed for creating standardized indicators to analyze human mobility patterns during COVID-19.[3] Findings showed that the lockdown disproportionately impacted urban areas by restricting economic activity, which should impact relief and reconstruction efforts.

The results of the analysis were submitted to the Ministry of Finance and the Ministry of Health. The team argued that learnings from the COVID-19 use case could inform targeted testing initiatives by focusing efforts on high-mobility areas. If a full lockdown is not possible, it could also shed light on where social distancing measures should be enforced to reduce the risk of transmission.

Amount. Mobility restrictions disproportionately affect urban areas

Scaling for more effects

The paper demonstrates the potential of CDR data for decision making. This approach offers the opportunity to leapfrog existing limitations in developing countries’ data collection capacity by leveraging data that is available in real-time, highly localized and cost-effective. However, as exemplified by this experience, the use of CDR data requires investment in the institutional and organizational framework of national statistical systems, including the necessary IT infrastructure and technical capacity. Once established, a CDR data pipeline can become an indispensable tool for government planning and disaster response.

[1] The data used for this project has been anonymized and no individually identifiable information has been included. Personal data can only be processed on PURA premises.

[2] Hadoop is a suite of open-source software for data-intensive and distributed applications aiming to solve massive amounts of data and computation.

[3] These tools were proposed by the World Bank’s COVID-19 Mobility Task Force, and codes to calculate the indicators are maintained as open source programs. See also Flowminder COVID-19 resources: