AI for Genomics

The genome sequence data can be used to predict new cases of COVID-19 strain and other related disorders. Sequencing data are made routinely available via various national and global consortia for genomic surveillance of SARS-CoV2.



XpresssionSuite, by IIIT-Delhi, is a user-friendly web platform for analyzing gene expression data. Its advanced tools help researchers find differentially expressed genes and pathways. Interactive visualizations aid data exploration, leading to biological discoveries.

Alok Anand, Manas Pratiti

Visit Project
Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning

The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape, increased transmissibility or pathogenicity. Early prediction for emergence of new strains with these features is critical for pandemic preparedness. We present Strainflow, a supervised and causally predictive model using unsupervised latent space features of SARS-CoV-2 genome sequences. Strainflow was trained and validated on 0.9 million sequences for the period December, 2019 to June, 2021 and the frozen model was prospectively validated from July, 2021 to December, 2021. Strainflow captured the rise in cases 2 months ahead of the Delta and Omicron surges in most countries including the prediction of a surge in India as early as beginning of November, 2021. Entropy analysis of Strainflow unsupervised embeddings clearly reveals the explore-exploit cycles in genomic feature-space, thus adding interpretability to the deep learning based model. We also conducted codon-level analysis of our model for interpretability and biological validity of our unsupervised features. Strainflow application is openly available as an interactive web-application for prospective genomic surveillance of COVID-19 across the globe.

Read more

Sargun Nagpal, Ridam Pal, Ashima, Ananya Tyagi, Sadhana Tripathi, Aditya Nagori, Saad Ahmad, Hara Prasad Mishra, Rintu Kutum, Tavpritesh Sethi

Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning

The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity. Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge. Here, we developed Strainflow, to learn the latent dimensions of 0.9 million high-quality SARS-CoV-2 genome sequences, and used machine learning algorithms to predict upcoming caseloads of SARS-CoV-2. In our Strainflow model, SARS-CoV-2 genome sequences were treated as documents, and codons as words to learn unsupervised codon embeddings (latent dimensions). We discovered that codon-level changes lead to a change in the entropy of the latent dimensions. We used a machine learning algorithm to find the most relevant latent dimensions called Dimensions of Concern (DoCs) of SARS-CoV-2 spike genes, and demonstrate their potential to provide a lead time for predicting new caseloads in several countries. The DoCs capture codons associated with global Variants of Concern (VOCs) and Variants of Interest (VOIs), and may be surveilled to predict country-specific emergence and spread of SARS-CoV-2 variants.

Read more

Sargun Nagpal, Ridam Pal, Ashima, Ananya Tyagi, Sadhana Tripathi, Aditya Nagori, Saad Ahmad, Hara Prasad Mishra, Rintu Kutum, Tavpritesh Sethi

Evaluating Sample Augmentation in Microarray Datasets with Generative Models: A Comparative Pipeline and Insights in Tuberculosis

COVID-19 knowledge has been changing rapidly with the fast pace of information that accompanied the pandemic. Since peer-reviewed research is a trusted source of evidence, capturing and predicting the emerging themes in COVID-19 literature are crucial for guiding research and policy. Machine learning, natural language processing and dynamical networks have the potential to enable rapid distillation and prediction of actionable insights for ending the pandemic.

Read more

Ayushi Gupta, Saad Ahmad, Atharva Sune, Chandan Gupta, Harleen Kaur, Rintu Kutum, Tavpritesh Sethi

Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature with Dynamic Word Embedding Networks and Machine Learning

Read more

Ridam Pal, Harshita Chopra, Raghav Awasthi, Harsh Bandhey, Aditya Nagori, Amogh Gulati, Ponnurangam Kumaraguru, Tavpritesh Sethi

lncRNA Mediated Hijacking of T-cell Hypoxia Response Pathway by Mycobacterium Tuberculosis Predicts Latent to Active Progression in Humans

Cytosolic functions of Long non-coding RNAs including mRNA translation masking and sponging are major regulators of biological pathways. Formation of T cell-bounded hypoxic granuloma is a host immune defense for containing infected Mtb-macrophages. Our study exploits the mechanistic pathway of Mtb-induced HIF1A silencing by the antisense lncRNA-HIF1A-AS2 in T cells. Computational analysis of in-vitro T-cell stimulation assays in progressors(n=119) versus latent(n=221) tuberculosis patients revealed the role of lncRNA mediated disruption of hypoxia adaptation pathways in progressors. We found 291 upregulated and 227 downregulated lncRNAs that were correlated at mRNA level with HIF1A and HILPDA which are major players in hypoxia response. We also report novel lncRNA-AC010655 (AC010655.4 and AC010655.2) in cis with HILPDA, both of which contain binding sites for the BARX2 transcription factor, thus indicating a mechanistic role. Detailed comparison of infection with antigenic stimulation showed a non-random enrichment of lncRNAs in the cytoplasmic fraction of the cell in progressors. The lack of this pattern in non-progressors indicates the hijacking of the lncRNA dynamics by Mtb. The in-vitro manifestation of this response in the absence of granuloma indicates pre-programmed host-pathogen interaction between T-cells and Mtb regulated through lncRNAs, thus tipping this balance towards progression or containment of Mtb. Finally, we trained multiple machine learning classifiers for reliable prediction of latent to the active progression of patients, yielding a model to guide aggressive treatment.

Read more

Jyotsana Mehra, Vikram Kumar, Priyansh Srivastava, Tavpritesh Sethi

RNA secondary structure profiling in zebrafish reveals unique regulatory features

RNA is known to play diverse roles in gene regulation. The clues for this regulatory function of RNA are embedded in its ability to fold into intricate secondary and tertiary structure. Results: We report the transcriptome-wide RNA secondary structure in zebrafish at single nucleotide resolution using Parallel Analysis of RNA Structure (PARS). This study provides the secondary structure map of zebrafish coding and non-coding RNAs. The single nucleotide pairing probabilities of 54,083 distinct transcripts in the zebrafish genome were documented. We identified RNA secondary structural features embedded in functional units of zebrafish mRNAs. Translation start and stop sites were demarcated by weak structural signals. The coding regions were characterized by the three-nucleotide periodicity of secondary structure and display a codon base specific structural constrain. The splice sites of transcripts were also delineated by distinct signature signals. Relatively higher structural signals were observed at 3' Untranslated Regions (UTRs) compared to Coding DNA Sequence (CDS) and 5' UTRs. The 3' ends of transcripts were also marked by unique structure signals. Secondary structural signals in long non-coding RNAs were also explored to better understand their molecular function. Conclusions: Our study presents the first PARS-enabled transcriptome-wide secondary structure map of zebrafish, which documents pairing probability of RNA at single nucleotide precision. Our findings open avenues for exploring structural features in zebrafish RNAs and their influence on gene expression.

Read more

Kriti Kaushik, Ambily Sivadas, Shamsudheen K Vellarikkal, Ankit Verma, Rijith Jayarajan, Satyaprakash Pandey S, Tavpritesh Sethi, Souvik Maiti, Vinod Scaria, Sridhar Sivasubbu

Exhaled breath condensate metabolome clusters for endotype discovery in asthma

The simplest definition of a disease is based on symptoms and the best definition of a disease is based on cause. Asthma is variously defined as a disorder of recurrent breathlessness and wheezing [1] and as a complex chronic inflammatory airway disease [2]. It is now mostly agreed upon that asthma is a heterogeneous clinical syndrome, which lacks singular pathophysiological explanation. Discovery of asthma endotypes—specific disease phenotype clusters, with a specific biological mechanism [3]—is a critical step towards personalized therapy. The discovery of such endotypes may proceed top down, from clinical phenotype to molecular signatures, or bottom up—from molecular signatures to clinical phenotypes. Studies reflecting the airway milieu, such as exhaled breath condensate (EBC) composition, appear to be a good place to start for a bottom up approach. A problem with EBC is that it is an unknown dilution of the airway lining fluid and while various suggestions have been made for normalization, none are reliable [4]. We previously reported the presence of a characteristic trident peak signature in nuclear magnetic resonance (NMR) of EBC found through visual inspection of the spectra. This peak signature at 7 parts per million (ppm), which was shown to be attributed to the concentration of ammonium ions in the airway milieu was absent in a majority of asthmatics while being present in healthy controls [5]. Many other informative patterns may exist in the NMR spectra but these are not obvious to the naked eye. Here, we considered the possibility that NMR signatures of EBC, taken together as a whole, rather than broken down into individual metabolites, could serve as a fingerprint for endotypes of asthma. While there have been initial studies about the local metabolomic patterns in the airway that could reflect the disease pathogenesis, these have been focused on identifying metabolites and comparing them [6,7,8]. Given the difficulties in compensating for variable dilutions and the limitations in accurately identifying all metabolites from mixed spectral signatures, we considered the possibility of directly using the global spectral pattern. This has the advantage of internal relative referencing of all peaks within a single spectrum, minimizing the impact of dilution. However, this has the disadvantage of creating a high-dimension dataset with likely strong internal correlations, requiring newer forms of statistical and computational analyses. Here, we show for the first time how global spectral signatures can be used to yield not only classifiers between asthma and healthy subjects, but also to discover clinically relevant metabolome clusters within asthma.

Read more

Anirban Sinha, Koundinya Desiraju, Kunal Aggarwal, Rintu Kutum, Siddhartha Roy, Rakesh Lodha, S. K. Kabra, Balaram Ghosh, Tavpritesh Sethi, Anurag Agrawal

Multifaceted remodeling by vitamin C boosts sensitivity of Mycobacterium tuberculosis subpopulations to combination treatment by anti-tubercular drugs

Bacterial dormancy is a major impediment to the eradication of tuberculosis (TB), because currently used drugs primarily target actively replicating bacteria. Therefore, decoding of the critical survival pathways in dormant tubercle bacilli is a research priority to formulate new approaches for killing these bacteria. Employing a network-based gene expression analysis approach, we demonstrate that redox active vitamin C (vit C) triggers a multifaceted and robust adaptation response in Mycobacterium tuberculosis (Mtb) involving ~ 67% of the genome. Vit C-adapted bacteria display well-described features of dormancy, including growth stasis and progression to a viable but non-culturable (VBNC) state, loss of acid-fastness and reduction in length, dissipation of reductive stress through triglyceride (TAG) accumulation, protective response to oxidative stress, and tolerance to first line TB drugs. VBNC bacteria are reactivatable upon removal of vit C and they recover drug susceptibility properties. Vit C synergizes with pyrazinamide, a unique TB drug with sterilizing activity, to kill dormant and replicating bacteria, negating any tolerance to rifampicin and isoniazid in combination treatment in both in-vitro and intracellular infection models. Finally, the vit C multi-stress redox models described here also offer a unique opportunity for concurrent screening of compounds/combinations active against heterogeneous subpopulations of Mtb. These findings suggest a novel strategy of vit C adjunctive therapy by modulating bacterial physiology for enhanced efficacy of combination chemotherapy with existing drugs, and also possible synergies to guide new therapeutic combinations towards accelerating TB treatment.

Read more

Kriti Sikri, Priyanka Duggal, Chanchal Kumar, Sakshi Dhingra Batra, Atul Vashist, Ashima Bhaskar, Kritika Tripathi, Tavpritesh Sethi, Amit Singh, Jaya Sivaswami Tyagi

Establishment of Elevated Serum Levels of IL-10, IL-8 and TNF-β as Potential Peripheral Blood Biomarkers in Tubercular Lymphadenitis: A Prospective Observational Cohort Study

Tubercular lymphadenitis (TL) is the most common form of extra-pulmonary tuberculosis (TB) consisting about 15–20% of all TB cases. The currently available diagnostic modalities for (TL), are invasive and involve a high index of suspicion, having limited accuracy. We hypothesized that TL would have a distinct cytokine signature that would distinguish it from pulmonary TB (PTB), peripheral tubercular lymphadenopathy (LNTB), healthy controls (HC), other lymphadenopathies (LAP) and cancerous LAP. To assess this twelve cytokines (Tumor Necrosis Factor (TNF)—α, Interferon (IFN) -γ, Interleukin (IL)-2, IL-12, IL-18, IL-1β, IL-10, IL-6, IL-4, IL-1Receptor antagonist (IL-1Ra), IL-8 and TNF-β, which have a role in pathogenesis of tuberculosis, were tested as potential peripheral blood biomarkers to aid the diagnosis of TL when routine investigations prove to be of limited value.

Read more

Abhimanyu, Mridula Bose, Mandira Varma-Basil, Ashima Jain, Tavpritesh Sethi, Pradeep Kumar Tiwari, Anurag Agrawal, Jayant Nagesh Banavaliker, Kumar Tapas Bhowmick

Fractional Exhaled Nitric Oxide (FENO) in Children with Acute Exacerbation of Asthma

Objective: To determine whether fractional exhaled nitric oxide (FENO) has a utility as a diagnostic or predictive maker in acute exacerbations of asthma in children. Design: Analysis of data collected in a pediatric asthma cohort. Setting: Pediatric Chest Clinic of a tertiary care hospital. Methods: A cohort of children with asthma was followed up every 3 months in addition to any acute exacerbation visits. Pulmonary function tests (PFT) and FENO were obtained at all visits. We compared the FENO values during acute exacerbations with those at baseline and those during the follow up. Results: 243 asthmatic children were enrolled from August 2009 to December 2011 [mean (SD) follow up — 434 (227) days]. FENO during acute exacerbations was not different from FENO during follow up; however, FENO was significantly higher than personal best FENO during follow up (P < 0.0001). FENO during acute exacerbation did not correlate with the severity of acute exacerbation (P=0.29). The receiver operating characteristics curve for FENO as a marker for acute exacerbation had an area under the curve of 0.59. Cut-off of 20 ppb had a poor sensitivity (44%) and specificity (68.7%) for acute exacerbation. Conclusions: FENO levels during acute exacerbation increase from their personal best levels. However, no particular cut off could be identified that could help in either diagnosing acute exacerbation or predicting its severity.

Read more

Dinesh Raj, Rakesh Lodha, Aparna Mukherjee, Tavpritesh Sethi, Anurag Agrawal, Sushil Kumar Kabra

Exosome Enclosed MicroRNAs In Exhaled Breath Hold Potential For Biomarker Discovery in Pulmonary Diseases

MicroRNAs (miRNAs) are small noncoding RNAs of 22 to 25 nucleotides in length that act through an RNA-induced silencing complex to posttranscriptionally regulate mRNAs that contain complementary sequences. Highly stable circulating miRNAs are found in biological fluids and are potential biomarkers.1, 2 These are often enclosed in small secretory membrane vesicles called exosomes, which permit transfer of miRNA between cells.3 Exhaled breath condensate (EBC), which can be obtained noninvasively and conveniently and is representative of the airway lining fluid, could be an ideal substrate for discovery of pulmonary disease biomarkers.4, 5, 6 For the first time, we report that miRNAs can be reliably detected in EBC by using quantitative PCR analysis and are suitable as biomarkers.

Read more

Anirban Sinha, Amit Kumar Yadav, Samarpana Chakraborty, S. K. Kabra, R. Lodha, Manish Kumar, Ankur Kulshreshtra, Tavpritesh Sethi, Rajesh Pandey, Gaurav Malik, Saurabh Laddha, Arijit Mukhopadhyay, Debasis Dash, Balaram Ghosh, Anurag Agrawal

Computational classification of mitochondrial shapes reflects stress and redox state

Read more

Tanveer Ahmad, Kunal Aggarwal, Bijay Ranjan Pattnaik, Shravani Mukherjee, Tavpritesh Sethi

Presence of strong association of the major histocompatibility complex (MHC) class I allele HLA-A*26:01 with idiopathic hypoparathyroidism

Dynamic variations in mitochondrial shape have been related to function. However, tools to automatically classify and enumerate mitochondrial shapes are lacking, as are systematic studies exploring the relationship of such shapes to mitochondrial stress. Here we show that during increased generation of mitochondrial reactive oxygen species (mtROS), mitochondria change their shape from tubular to donut or blob forms, which can be computationally quantified. Imaging of cells treated with rotenone or antimycin, showed time and dose-dependent conversion of tubular forms to donut-shaped mitochondria followed by appearance of blob forms. Time-lapse images showed reversible transitions from tubular to donut shapes and unidirectional transitions between donut and blob shapes. Blobs were the predominant sources of mtROS and appeared to be related to mitochondrial-calcium influx. Mitochondrial shape change could be prevented by either pretreatment with antioxidants like N-acetyl cysteine or inhibition of the mitochondrial calcium uniporter. This work represents a novel approach towards relating mitochondrial shape to function, through integration of cellular markers and a novel shape classification algorithm.

Read more

Samir K Brahmachari, Tavpritesh Sethi, Amit Kumar Mandal, Arijit Mukhopadhyay, Rajni Rani

Metabolomic signatures in nuclear magnetic resonance spectra of exhaled breath condensate identify asthma

Exhaled breath condensate (EBC) holds promise as a noninvasive method of collecting airway-lining fluid, although at an unknown dilution [1]. While metabolomic studies of EBC using nuclear magnetic resonance (NMR) spectroscopy have previously shown promise in asthma diagnosis and subtyping [2, 3], a study that was recently published in the European Respiratory Journal that failed to find usable NMR signature in EBC collections from disposable systems [4]. This led the authors to conclude that NMR metabolomics lacks sufficient sensitivity for metabolic fingerprinting of EBC. Interestingly, they were able to obtain high quality results from the same samples with mass spectroscopy, which they recommended for future use. As this is a nascent and technically complex field, we present our very different early experiences, which suggest that reproducible, valid and useful NMR metabolomic fingerprinting of EBC is indeed possible. Specifically, we found that the presence or absence of a trident peak at 7 ppm during NMR spectroscopy reliably distinguished between EBC samples collected from normal and asthmatic subjects, respectively. This peak probably represents ammonium ion, loss of which in asthma is consistent with reduced expression of glutaminase, an enzyme that converts glutamine to glutamate and ammonia [5].

Read more

Anirban Sinha, Veda Krishnan, Tavpritesh Sethi, Siddhanta Roy, Balaram Ghosh, Rakesh Lodha, Sushil Kabra, Anurag Agrawal

Ayurgenomics: A New Way of Threading Molecular Variability for Stratified Medicine

Read more

Tav Pritesh Sethi, Bhavana Prasher, Mitali Mukerji

Structure and function of the tuberculous lung: Considerations for inhaled therapies

Inhaled therapies for pulmonary tuberculosis are in development and appear promising at first look. A fundamental premise of such therapy is efficient delivery of drug at high concentrations to the active disease site, while minimizing systemic delivery. This assumes that inhaled drug will actually reach the diseased lung, which while intuitive for healthy lungs, may be untrue for diseased lungs with abnormal structure or function. This review discusses the structural and functional aspects of respiratory physiology that are likely to impact local drug delivery and presents the available evidence on how this pertains to tuberculous lungs.

Read more

Tavpritesh Sethi, Anurag Agrawal

EGLN1 involvement in high-altitude adaptation revealed through genetic analysis of extreme constitution types defined in Ayurveda

It is being realized that identification of subgroups within normal controls corresponding to contrasting disease susceptibility is likely to lead to more effective predictive marker discovery. We have previously used the Ayurvedic concept of Prakriti, which relates to phenotypic differences in normal individuals, including response to external environment as well as susceptibility to diseases, to explore molecular differences between three contrasting Prakriti types: Vata, Pitta, and Kapha. EGLN1 was one among 251 differentially expressed genes between the Prakriti types. In the present study, we report a link between high-altitude adaptation and common variations rs479200 (C/T) and rs480902 (T/C) in the EGLN1 gene. Furthermore, the TT genotype of rs479200, which was more frequent in Kapha types and correlated with higher expression of EGLN1, was associated with patients suffering from high-altitude pulmonary edema, whereas it was present at a significantly lower frequency in Pitta and nearly absent in natives of high altitude. Analysis of Human Genome Diversity Panel-Centre d’Etude du Polymorphisme Humain (HGDP-CEPH) and Indian Genome Variation Consortium panels showed that disparate genetic lineages at high altitudes share the same ancestral allele (T) of rs480902 that is overrepresented in Pitta and positively correlated with altitude globally (P < 0.001), including in India. Thus, EGLN1 polymorphisms are associated with high-altitude adaptation, and a genotype rare in highlanders but overrepresented in a subgroup of normal lowlanders discernable by Ayurveda may confer increased risk for high-altitude pulmonary edema.

Read more

Shilpi Aggarwal, Tav P. Sethi, A K Mandal, A Mukhopadhyay

Whole genome expression and biochemical correlates of extreme constitutional types defined in Ayurveda

Ayurveda is an ancient system of personalized medicine documented and practiced in India since 1500 B.C. According to this system an individual's basic constitution to a large extent determines predisposition and prognosis to diseases as well as therapy and life-style regime. Ayurveda describes seven broad constitution types (Prakriti s) each with a varying degree of predisposition to different diseases. Amongst these, three most contrasting types, Vata, Pitta, Kapha, are the most vulnerable to diseases. In the realm of modern predictive medicine, efforts are being directed towards capturing disease phenotypes with greater precision for successful identification of markers for prospective disease conditions. In this study, we explore whether the different constitution types as described in Ayurveda has molecular correlates.

Read more

Bhavana Prasher, Sapna Negi, Shilpi Aggarwal, Amit K Mandal, Tavpritesh Sethi, Shailaja R Deshmukh, Sudha G Purohit, Shantanu Sengupta, Sangeeta Khanna, Farhan Mohammad, Gaurav Garg, Samir K Brahmachari, Indian Genome Variation Consortium, Mitali Mukerji

Genetic landscape of the people of India: a canvas for disease gene exploration

Analyses of frequency profiles of markers on disease or drug-response related genes in diverse populations are important for the dissection of common diseases. We report the results of analyses of data on 405 SNPs from 75 such genes and a 5.2 Mb chromosome, 22 genomic region in 1871 individuals from diverse 55 endogamous Indian populations. These include 32 large (>10 million individuals) and 23 isolated populations, representing a large fraction of the people of India. We observe high levels of genetic divergence between groups of populations that cluster largely on the basis of ethnicity and language. Indian populations not only overlap with the diversity of HapMap populations, but also contain population groups that are genetically distinct. These data and results are useful for addressing stratification and study design issues in complex traits especially for heterogeneous populations.

Read more

Samir K Brahmachari, Tavpritesh Sethi, Amit Kumar Mandal, Arijit Mukhopadhyay