We leverage cutting-edge machine learning and artificial intelligence to combat infectious disease threats. We demonstrated this during the COVID-19 pandemic by utilizing data-driven approaches. This involved integrating diverse datasets for genomic surveillance, predicting the spread of infection, and even misinformation. These efforts aimed to inform effective policy interventions.

covid image


We developed Strainflow for SARS-CoV-2 genome sequences, where sequences were treated as documents with words (codons) to learn the codon context of 0.9 million spike genes using the skip-gram algorithm.

Sargun Nagpal, Ridam Pal, Ashima, Ananya Tyagi, Sadhana Tripathi, Aditya Nagori, Saad Ahmad, Hara Prasad Mishra, Tavpritesh Sethi, Rintu Kutum

Visit Project
Characterizing the Emotion Carriers of COVID-19 Misinformation and Their Impact on Vaccination Outcomes in India and the United States

The COVID-19 Infodemic had an unprecedented impact on health behaviors and outcomes at a global scale. While many studies have focused on a qualitative and quantitative understanding of misinformation, including sentiment analysis, there is a gap in understanding the emotion-carriers of misinformation and their differences across geographies. In this study, we characterized emotion carriers and their impact on vaccination rates in India and the United States. A manually labelled dataset was created from 2.3 million tweets and collated with three publicly available datasets (CoAID, AntiVax, CMU) to train deep learning models for misinformation classification. Misinformation labelled tweets were further analyzed for behavioral aspects by leveraging Plutchik Transformers to determine the emotion for each tweet. Time series analysis was conducted to study the impact of misinformation on spatial and temporal characteristics. Further, categorical classification was performed using transformer models to assign categories for the misinformation tweets. Word2Vec+BiLSTM was the best model for misinformation classification, with an F1-score of 0.92. The US had the highest proportion of misinformation tweets (58.02%), followed by the UK (10.38%) and India (7.33%). Disgust, anticipation, and anger were associated with an increased prevalence of misinformation tweets. Disgust was the predominant emotion associated with misinformation tweets in the US, while anticipation was the predominant emotion in India. For India, the misinformation rate exhibited a lead relationship with vaccination, while in the US it lagged behind vaccination. Our study deciphered that emotions acted as differential carriers of misinformation across geography and time. These carriers can be monitored to develop strategic interventions for countering misinformation, leading to improved public health.

Read more

Ridam Pal, Sanjana S, Deepak Mahto, Kriti Agrawal, Gopal Mengi, Sargun Nagpal, Akshaya Devadiga, Tavpritesh Sethi

Mining Trends of COVID-19 Vaccine Beliefs on Twitter With Lexical Embeddings: Longitudinal Observational Study

Social media plays a pivotal role in disseminating news globally and acts as a platform for people to express their opinions on various topics. A wide variety of views accompany COVID-19 vaccination drives across the globe, often colored by emotions that change along with rising cases, approval of vaccines, and multiple factors discussed online.

Read more

Harshita Chopra, Aniket Vashishtha, Ridam Pal, Ashima Ashima, Ananya Tyagi, Tavpritesh Sethi

Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study

Evidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. In massive and rapidly growing corpuses, such as COVID-19 publications, assimilating and synthesizing information is challenging. Leveraging a robust computational pipeline that evaluates multiple aspects, such as network topological features, communities, and their temporal trends, can make this process more efficient.

Read more

Ridam Pal, Harshita Chopra, Raghav Awasthi, Harsh Bandhey, Aditya Nagori, Amogh Gulati, Ponnurangam Kumaraguru, Tavpritesh Sethi

Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning

The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape, increased transmissibility or pathogenicity. Early prediction for emergence of new strains with these features is critical for pandemic preparedness. We present Strainflow, a supervised and causally predictive model using unsupervised latent space features of SARS-CoV-2 genome sequences. Strainflow was trained and validated on 0.9 million sequences for the period December, 2019 to June, 2021 and the frozen model was prospectively validated from July, 2021 to December, 2021. Strainflow captured the rise in cases 2 months ahead of the Delta and Omicron surges in most countries including the prediction of a surge in India as early as beginning of November, 2021. Entropy analysis of Strainflow unsupervised embeddings clearly reveals the explore-exploit cycles in genomic feature-space, thus adding interpretability to the deep learning based model. We also conducted codon-level analysis of our model for interpretability and biological validity of our unsupervised features. Strainflow application is openly available as an interactive web-application for prospective genomic surveillance of COVID-19 across the globe.

Read more

Sargun Nagpal, Ridam Pal, Ashima, Ananya Tyagi, Sadhana Tripathi, Aditya Nagori, Saad Ahmad, Hara Prasad Mishra, Rintu Kutum, Tavpritesh Sethi

Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature with Dynamic Word Embedding Networks and Machine Learning

Read more

Ridam Pal, Harshita Chopra, Raghav Awasthi, Harsh Bandhey, Aditya Nagori, Amogh Gulati, Ponnurangam Kumaraguru, Tavpritesh Sethi

(Un)Masked COVID-19 Trends from Social Media

The adoption of non-pharmaceutical interventions and their surveillance is critical for detecting and stopping possible transmission routes of COVID-19. A study of the effects of these interventions in terms of adoption can help shape public health decisions. Also, the efficacy of Non-Pharmaceutical Interventions can be affected by public behaviours in events such as election rallies, festivals and protest events, as captured from social media. Social media analytics can offer crucial public health insights Here we examined mask use and mask fit in the United States, especially during the first large scale public gathering post pandemic, the Black Lives Matter (BLM). This study aimed to analyze the utilization and fit of face masks and social distancing in the USA from social media and events of large physical gatherings through publicly available social media images from six cities and the BLM protests. 2.04 million publicly available social media images were collected and analyzed from the six cities between February 1, 2020, and May 31, 2020. We used correlation tests to examine the relationships between the online mask usage trends and the COVID-19 cases. We looked for significant changes in mask-wearing patterns and group posting before and after important policy decisions. For BLM protests, we analyze 195,452 posts from New York and Minneapolis from May 25, 2020, to July 15, 2020. We looked at differences in adopting the preventive measures in the BLM protests through the mask-fit score. The average percentage of group pictures dropped from 8.05% to 4.65% post the lockdown week. New York City, Dallas, Seattle, New Orleans, Boston and Minneapolis observed an increase of 5%, 7.4%, 7.4%, 6.5%, 5.6% and 7.1% in mask wearing online, respectively, between February 2020 and May 2020. Boston and Minneapolis observed a significant increase of 3% and 7.4% mask-wearing after the mask mandates. A difference of 6.2% and 8.3% were found in the group pictures between BLM posts and Non-BLM posts for New York City and Minneapolis. In contrast, the difference between BLM and NON-BLM posts in the percentage of masked faces in group pictures was 29% and 20.1% for New York City and Minneapolis, respectively. Of the masked faces in protests, 35% wore the mask with a fit score greater than 80%. The study finds a significant drop in the group posting when the stay-at-home laws were applied and a significant increase in mask wearing for two of the three cities when the mask mandates were applied. Although a general positive trend towards mask-wearing and social distancing is observed, a high percentage of posts did not adhere to the guidelines. BLM related posts were found to capture the lack of seriousness to safety measures through a high percentage of group pictures and low mask fit scores. Thus, the methodology used provides a directional indication of how government policies can be indirectly monitored through social media.

Read more

Asmit Kumar Singh, Paras Mehan, Divyanshu Sharma, Rohan Pandey, Tavpritesh Sethi, Ponnurangam Kumaraguru

Learning the Mental Health Impact of COVID-19 in the United States with Explainable Artificial Intelligence

COVID-19 pandemic has deeply affected the health, economic, and social fabric of nations. Identification of individual-level susceptibility factors may help people in identifying and managing their emotional, psychological, and social well-being.

Read more

Indra Prakash Jha, Raghav Awasthi, Ajit Kumar, Vibhor Kumar, Tavpritesh Sethi

Temperature and Humidity Do Not Influence Global COVID-19 Incidence as Inferred from Causal Models

The relationship between meteorological factors such as temperature and humidity with COVID-19 incidence is still unclear after 6 months of the beginning of the pandemic. Some literature confirms the association of temperature with disease transmission while some oppose the same. This work intends to determine whether there is a causal association between temperature, humidity and Covid-19 cases. Three different causal models were used to capture stochastic, chaotic and symbolic natured time-series data and to provide a robust & unbiased analysis by constructing networks of causal relationships between the variables. Granger-Causality method, Transfer Entropy method & Convergent Cross-Mapping (CCM) was done on data from regions with different temperatures and cases greater than 50,000 as of 13th May 2020. From the Granger-Causality test we found that in only Canada, the United Kingdom, temperature and daily new infections are causally linked. The same results were obtained from Convergent Cross Mapping for India. Again using Granger-Causality test, we found that in Russia only, relative humidity is causally linked to daily new cases. Thus, a Generalized Additive Model with a smoothing spline function was fitted for these countries to understand the directionality. Using the combined results of the said models, we were able to conclude that there is no evidence of a causal association between temperature, humidity and Covid-19 cases.

Read more

Raghav Awasthi, Aditya Nagori, Pradeep Singh, Ridam Pal, Vineet Joshi, Tavpritesh Sethi

A Counterfactual Graphical Model Reveals Economic and Sociodemographic Variables as Key Determinants of Country-Wise COVID-19 Burden

The novel coronavirus SARS-CoV-2, which originated in China months back, has dramatically enveloped the global population crossing all boundaries and borders, infecting more than 5 million people and causing more than 300,000 deaths as on 21 May 2020. The distinct difference in the disease burden (including infectivity and mortality) between the regions across the globe is an enigma. Despite harboring 60% of the global population, Asia accounts for only ~18% global cases and <10% global mortality due to COVID-192. Western Europe (Italy, France, Spain, United Kingdom) and USA account for about 50% and 70% of global cases and mortality despite the fact that China continued to contribute to >80% of global cases till the end of March first week. Currently, the two most populous countries in the World- India and China (accounting for 35% global population) together account for less than 4% and 3% global cases and mortality, respectively. These observations have displayed a temporal consistency with almost similar country-wise distribution of cases over the 1-month period, highlighting the impact of consistent factors which govern these epidemiologic associations

Read more

Tavpritesh Sethi, Saurabh Kedia, Raghav Awasthi, Rakesh Lodha, Vineet Ahuja

Less Wrong COVID-19 Projections With Interactive Assumptions

COVID-19 pandemic is an enigma with uncertainty caused by biological and health systems factors. Although many models have been developed all around the world, transparent models that allow interacting with the assumptions will become more important as we test various strategies for lockdown, testing and social interventions and enable effective policy decisions. In this paper we developed a suite of models to guide development of policies under different scenarios when the lockdown opens. These had been deployed to create an interactive dashboard called COVision which includes the Agent based Models (ABM) and classical compartmental models i.e. Susceptible-Infected-Recovered (SIR) and Susceptible-Exposed-Infected-Recovered (SEIR) approaches. Our tool allows simulation of scenarios by changing strength of lockdown, basic reproduction number(R0), asymptomatic spread, testing rate, contact rate (Beta), recovery rate (Gamma), incubation period and starting number of cases. We optimized ABMs and classical compartmental models to fit the actual data, both of which performed well in terms of R-squared, root mean squared error (RMSE) and mean absolute percentage error (MAPE). Out of the three models in our suite, ABM was able to capture the data better than SIR and SEIR and achieved an RSQ of 92.3% for India and 89% for Maharashtra for the next 30 days. We also computed R0 using SIR and SEIR models which were found to be decreasing over the different periods of lockdown indicating the effectiveness of policies and interventions. Finally, we formulated ICU bed requirements using our best models. Our evaluation suggests that ABM models were able to capture the dynamic nature of the epidemic for a longer duration of time while classical SIR and SEIR models performed inefficiently for longer terms. The visual interactivity and ability to simulate outcomes under different parameters will allow the policymakers to make informed decisions for estimating the strength of lockdown to be implemented and testing rates. Further, our models were able to highlight the differences at state level for the parameters such as R0 and contact rates and hence can be applied for state specific decision making. An interactive dashboard have been hosted as a web-server for the war level monitoring of the covid19 pandemic in India in public domain

Read more

Aditya Nagori, Raghav Awasthi, Vineet Joshi, Suryatej Reddy Vyalla, Akhil Jarodia, Chandan Gupta, Amogh Gulati, Harsh Bandhey, Ponnurangam Kumaraguru, Tavpritesh Sethi

Psychometric Analysis and Coupling of Emotions Between State Bulletins and Twitter in India during COVID-19 Infodemic

COVID-19 infodemic has been spreading faster than the pandemic itself. The misinformation riding upon the infodemic wave poses a major threat to people's health and governance systems. Since social media is the largest source of information, managing the infodemic not only requires mitigating of misinformation but also an early understanding of psychological patterns resulting from it. During the COVID-19 crisis, Twitter alone has seen a sharp 45% increase in the usage of its curated events page, and a 30% increase in its direct messaging usage, since March 6th 2020. In this study, we analyze the psychometric impact and coupling of the COVID-19 infodemic with the official bulletins related to COVID-19 at the national and state level in India. We look at these two sources with a psycho-linguistic lens of emotions and quantified the extent and coupling between the two. We modified path, a deep skip-gram based open-sourced lexicon builder for effective capture of health-related emotions. We were then able to capture the time-evolution of health-related emotions in social media and official bulletins. An analysis of lead-lag relationships between the time series of extracted emotions from official bulletins and social media using Granger's causality showed that state bulletins were leading the social media for some emotions such as Medical Emergency. Further insights that are potentially relevant for the policymaker and the communicators actively engaged in mitigating misinformation are also discussed. Our paper also introduces CoronaIndiaDataset2, the first social media based COVID-19 dataset at national and state levels from India with over 5.6 million national and 2.6 million state-level tweets. Finally, we present our findings as COVibes, an interactive web application capturing psychometric insights captured upon the CoronaIndiaDataset, both at a national and state level.

Read more

Baani Leen Kaur Jolly, Palash Aggrawal, Amogh Gulati, Amarjit Singh Sethi, Ponnurangam Kumaraguru, Tavpritesh Sethi

CovidNLP: A Web Application for Distilling Systemic Implications of COVID-19 Pandemic with Natural Language Processing

The flood of conflicting COVID-19 research has revealed that COVID-19 continues to be an enigma. Although more than 14,000 research articles on COVID-19 have been published with the disease taking a pandemic proportion, clinicians and researchers are struggling to distill knowledge for furthering clinical management and research. In this study, we address this gap for a targeted user group, i.e. clinicians, researchers, and policymakers by applying natural language processing to develop a CovidNLP dashboard in order to speed up knowledge discovery. The WHO has created a repository of about more than 5000 peer-reviewed and curated research articles on varied aspects including epidemiology, clinical features, diagnosis, treatment, social factors, and economics. We summarised all the articles in the WHO Database through an extractive summarizer followed by an exploration of the feature space using word embeddings which were then used to visualize the summarized associations of COVID-19 as found in the text. Clinicians, researchers, and policymakers will not only discover the direct effects of COVID-19 but also the systematic implications such as the anticipated rise in TB and cancer mortality due to the non-availability of drugs during the export lockdown as highlighted by our models. These demonstrate the utility of mining massive literature with natural language processing for rapid distillation and knowledge updates. This can help the users understand, synthesize, and take pre-emptive action with the available peer-reviewed evidence on COVID-19. Our models will be continuously updated with new literature and we have made our resource CovidNLP publicly available in a user-friendly fashion at

Read more

Raghav Awasthi, Ridam Pal, Harshita Chopra, Harsh Bandhey, Pradeep Singh, Aditya Nagori, Suryatej Reddy, Amogh Gulati, Ponnurangam Kumaraguru, Tavpritesh Sethi