By Sam Neufeld
As exhibited at the Government of India’s inaugural Artiﬁcial Intelligence (AI) conference, RAISE 2020, there is a growing recognition that AI poses to signiﬁcantly accelerate progress towards achieving improved developmental and socioeconomic outcomes. In fact, studies published by Nature and the McKinsey Global Institute demonstrate that AI could enable improvements in 100+ targets across the United Nations’ 17 Sustainable Development Goals (SDGs). This progress is generally thought to be from AI-driven technological interventions.
However, the potential of AI-driven ‘public interest’ technologies — such as in ﬁelds like agriculture, health, and education — depends largely on the availability of suitable datasets. Reﬁned datasets are required to train autonomous algorithms (what we refer to as AI) that in turn power enhanced analytics, perception, decision-making, and prediction technologies. The more extensive and high-quality the data that is fed to a model, the more accurate, eﬀective, and impactful an AI-enabled technology can become.
Presently, the majority of well-known AI-enabled technologies rely on proprietary datasets owned by large technology corporations and their subsidiaries. These companies can leverage their enormous troves of user data or pay large sums to acquire datasets in order to fuel internal research and development, providing signiﬁcant competitive advantage. In contrast, cash-strapped startups often struggle to collect, procure, and reﬁne a suﬃcient amount of data. To this extent, adequate data is one of the biggest barriers experienced by Indian AI startups as well as other public agencies and civic organisations attempting to harness AI for social good.
The scope and value of open (government) data
Each year, the Union and State governments combined spend hundreds of billions of dollars to provide services like healthcare, food, insurance, education, skill training, housing, and fuel to more than 500 million Indian citizens. Moreover, they play a pivotal role in shaping the nation’s conditions for energy, infrastructure, transportation, and environmental stewardship. In the course of these operations, they organically collect tremendous amounts of data. At this time, more data in sectors like agriculture, rural development, education, migration, and energy is in the government’s hands than any other stakeholder in India by a wide margin.
In many countries with highly developed technology ecosystems, startups turn to government data repositories, also referred to as Open Data platforms, to source datasets in the aforementioned sectors. However, in India, eﬀorts to provide the public with access to government data remain signiﬁcantly constrained. Presently, the Union government’s primary Open Data initiative, the Open Government Data Platform, hosts hundreds of thousands of datasets. However, it is compromised by issues of quality, disparate standardisation and gaps in high-value data.
Given that data preparation and engineering tasks already comprise approximately 800 percent of time spent on AI projects, these circumstances make it exceedingly diﬃcult for AI startups to leverage the Open Government Data Platform’s datasets. In some cases, Indian technologists opt to use data from nations with mature Open Datata programmes. However, given the signiﬁcant variation in demographic, socioeconomic, epidemiological, and climatic circumstances, data collected from these geographies has limited use in informing AI models deployed in India’s diverse locales.
While the Open Government Data Platform boasts a signiﬁcant repository, the Union government’s National Strategy for AI (NSAI) asserts that the “Government of India has large amounts of data lying in silos across ministries.” If the vast majority of India’s Union and State government datasets in areas of social consequence were available to the public in a high-quality, machine-readable format, it would greatly enhance the prospect for domestic development of public interest AI. This strongly aligns with the government’s socially-oriented AI strategy. Captured by the mantra of ‘AI for All,’ India’s NSAI commits to leveraging the technology to drive inclusive growth and progress in ﬁelds like agriculture, health, and education.
As the Union government prepares to launch the forthcoming National Program for AI (also known as the AI Mission), a nodal initiative informed by the NSAI to foster the nation’s AI ecosystem, it is critical that government, industry, and civil society address the fundamental bottlenecks and ineﬃciencies impeding the nation’s Open Data eﬀorts. Furthermore, they should launch an initiative dedicated to fostering Open Data in such a way so as to maximise its potential use by Indian startups, corporations, and other beneﬁciaries.
Building institutional capacity and improving the budget allocation for Open Data
The technical nature of the Open Data mandate entails the collection, processing, and publication of data at each government institution. These functions demand highly-skilled human resources, ample capacity, and sophisticated Standard Operating Procedures. However, according to Gaurav Godhwani, CEO of CivicDataLab, “most government departments are over-worked with their existing responsibilities, and Open Data work comes as an additional burden on already tired shoulders.”
Given the dearth in the bandwidth of existing staﬀ, the government should consider augmenting institutional capacity by providing a budget to onboard dedicated data management and Open Data personnel in each government agency. Like IT systems require regular maintenance from specialised teams with suﬃcient manpower and time, institutional data likewise requires personnel with the time and skills required to eﬀectively collect, reﬁne, and curate datasets. The insertion of dedicated data cadres into government institutions would almost certainly expand the Indian bureaucracy’s ability to reposit data internally and release relevant datasets to the public.
Simultaneously, the government should invest in educational and training programmes aimed at enhancing the general capacity of existing Open Data personnel. Presently, training is largely limited to guidance for contributing datasets to and operating the Open Government Data Platform. Programmes should be expanded to include technical skills training as well as support for the creation or reﬁning of institution-speciﬁc Standard Operating Procedures relevant to the collection, aggregation, and engineering of datasets. This eﬀort would facilitate the publication of an increased number of high-quality and high-value datasets, as well as heighten awareness of emerging sector-speciﬁc privacy, security, and conﬁdentiality requirements.
Codifying the open government data mandate into law
Under the 2012 National Data Sharing and Accountability Policy (NDSAP), where India’s Open Data mandate originated, each government agency’s Chief Data Oﬃcer receives broad discretion in curating the datasets their institution will contribute to the Open Government Data Platform. However, this has led to non-transparent, inconsistent, and untargeted data sharing practices across institutions. Furthermore, NDSAP neglected to create any meaningful incentive or accountability infrastructure for Open Data personnel, leading to concerning gaps in motivation and performance.
To overcome these impediments, the Government of India should consider enacting comprehensive Open Data legislation. An Open Data Act should incorporate a robust framework of checks and balances for Open Data personnel, delineate extensive criteria for the identiﬁcation and selection of datasets, and expand India’s Open Data mandate to State governments (presently, it is only applicable to Union government entities).
Consolidating disparate Open Data initiatives by enabling interoperability and promoting collaboration
Since the Union government launched the Open Government Data Platform in 2012, a range of governmental, academic, and civil society organisations have launched separate Open Data platforms. Some notable examples include the India Urban Data Exchange (The Ministry of Housing and Urban Aﬀairs Smart City Mission), Pune DataStore (Municipality of Pune), The India Observatory (Foundation for Ecological Security), Open Budgets India (Centre for Budget and Governance Accountability), The India Data Portal (Indian School of Business), and the forthcoming National Data Analytics Platform (NITI Aayog), amongst others.
While these eﬀorts are a testimony to the combined will and capacity of the broader Indian Open Data community, they also present critical gaps and ineﬃciencies. At present, the aforementioned platforms are not interoperable, meaning that they act as siloed data repositories and do not share information with each other. Even if a technologist is familiar with the Government of India’s ﬂagship platform, it is probable that they will be unaware of many other existing Open Data initiatives in India. This means that the vast stores of data in other repositories are by and large untapped amongst innovators in the AI community.
In the interest of amplifying the use of datasets hosted across various platforms, each institution hosting an Open Data platform should strive to retroactively incorporate broad data interoperability. This would allow datasets hosted on each platform to be automatically pushed to and be accessible via the others, allowing technologists and other beneﬁciaries to make better use of the collective of Open Data. The Union government could lead such an eﬀort by creating and open-sourcing software communication infrastructure (e.g., APIs) that connect disparate Indian Open Data platforms. A precedent for such an eﬀort can be drawn from the India Urban Data Exchange, which intends to facilitate the transfer of urban-focused data between various government-operated repositories. This model could be expanded to include non-governmental data platforms and a larger range of sectors.
In addition, as is demonstrated by a range of public-private co-creation initiatives, there is vast potential to leverage public-private-civic partnerships to improve the status of key governmental Open Data programmes. For example, CivicDataLab has been at the forefront of upskilling state-level oﬃcials to open up government data in sectors like ﬁnance and budgeting.
Streamlining focus on Open Data for AI
Alongside the nation’s forthcoming National Programme for AI, the government should constitute a committee drawn from industry, academia, and civil society dedicated to creating the conditions necessary to leverage the nation’s Open Data to fuel AI innovation. In the United States, generally considered the world’s leading AI ecosystem, the federal government likewise recently launched a task force to make more government data available to AI innovators and researchers. Echoing similar sentiments, the European Commission (the EU’s executive body) has identiﬁed Open Data as a “critical asset for the development of new technologies, such as artiﬁcial intelligence (AI), which require the processing of vast amounts of high-quality data.”
These sentiments underpin GAIA-X a European initiative to develop infrastructure for a data ecosystem that promotes innovation while also meeting privacy, transparency, security and rights standards.
In India, this eﬀort should begin with a demand-side assessment of the Open Data ecosystem via consultations with AI innovators at startups, domestic corporates, and multinationals. After evaluating where stakeholders in AI perceive the speciﬁc gaps and limitations in Open Data, the government can accordingly develop a roadmap for prioritised improvements. Moreover, this information could help oﬃcials in the National Informatics Centre and/or the Department of Science and Technology to identify an informed criteria for the identiﬁcation and publication of high-value Open Data for use in AI. This criteria could then be extrapolated to operational guidelines, which could be instituted in the Open Data practices of government personnel through incorporation in new NDSAP (Open Data Policy) Implementation Guidelines.
In the wake of the devastation wreaked by the COVID-19 pandemic, a uniﬁed campaign to leverage Open Data for AI, coupled with the aforementioned fundamental reforms required to bolster Open Data could serve to precipitate meaningful social and developmental impacts in India by unlocking the potential for disruptive technological innovation in sectors like health, agriculture, and education. Meanwhile, it is important that Personal Data Protection legislation be enacted by the Indian Parliament as soon as possible to institute measures that safeguard the privacy, security, and rights of individuals, in particular with respect to the release of (anonymised) sensitive data.
The views expressed above belong to the author(s).