Extract potential Corona drug information from literature
Effectiveness of drugs being developed and tried to treat COVID-19 patients.
- TOPIC: Effectiveness of drugs being developed and tried to treat COVID-19 patients.
- Introduction
- Code development for insights
- Load and Clean Data
- Pre-filter by COVID-19
- Apply Scispacy Model
- Match relevant tokens, e.g. COVID-19, trial and usage indicators
- Example Articles that talk about COVID-19
- Extract all drugs and therapeutics from abstracts
- Simple Concordance Visualiser
- Organise matches by Drugs/Therapeutics
- Molecular Structure
- Conclusion
TOPIC: Effectiveness of drugs being developed and tried to treat COVID-19 patients.
Introduction
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
Dataset Description
The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date. This allows the worldwide AI research community the opportunity to apply text and data mining approaches to find answers to questions within, and connect insights across, this content in support of the ongoing COVID-19 response efforts worldwide. There is a growing urgency for these approaches because of the rapid increase in coronavirus literature, making it difficult for the medical community to keep up.
References:
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-13. Retrieved from https://pages.semanticscholar.org/coronavirus-research. doi:10.5281/zenodo.3715506
In this work we will make use of NLP, text mining, dataframe processing and visualization resources.
All document IDs are unique, nothing to tidy up. But there seem to be missing titles, abstracts and possibly missing bodies.
We will be working with abstracts. They provide an appropriate level of detail for the question at hand. Thus, we will drop all documents that do not have an abstract.
nlp = en_ner_bc5cdr_md.load()
# nlp = spacy.load('../input/scispacy-model/en_ner_bc5cdr_md-0.2.4/en_ner_bc5cdr_md/en_ner_bc5cdr_md-0.2.4')
doc = nlp(example_text)
colors = {
'CHEMICAL': 'lightpink',
'DISEASE': 'lightorange',
}
displacy.render(doc, style='ent', options={
'colors': colors
})
Organise matches by Drugs/Therapeutics
Above, we compiled a list of drugs/therapeutics that are relevant in the context of COVID-19. Now, we can dive deeper into the contexts these drugs appear in.
To this end, we match words that indicate the context of the drug mention:
- drug is in an idea stage (e.g. 'darunavir could be useful against COVID-19')
- drug is in a trial stage (e.g. 'lopinavir is currently being trialled')
- drug is in usage stage (e.g. 'patients are being treated with ritonavir')
These 'indicator' words are marked as additional entities in context.
We can clearly see the separation of compounds groups. The most important of which are: antiretroviral protease inhibitors, corticosteroids, polyene antibiotics, glycosaminoglycans heparan sulfate, proteases and aminoquinoline derivatives. Note: It was checked on Pubchem database.
Public References: (In case of doubt to which chemical class a certain compound belongs, you can consult the public database PubChem )
https://pubchem.ncbi.nlm.nih.gov
https://pubchem.ncbi.nlm.nih.gov/compound/392622
https://pubchem.ncbi.nlm.nih.gov/compound/5755
https://pubchem.ncbi.nlm.nih.gov/compound/213039
https://pubchem.ncbi.nlm.nih.gov/compound/70678539
https://pubchem.ncbi.nlm.nih.gov/compound/5479537
https://pubchem.ncbi.nlm.nih.gov/compound/Amphotericin%20B
Printing and comparing molecular structures
Now we will print and compare those mentioned with other drugs with similar molecular structures in the public ChEMBL database. You can find more information about the database on https://www.ebi.ac.uk/chembl/
Note: If you want to zoom in, or rotate de molecule, just click, scroll and move the mouse inside the molecule picture. Note 2: We will select only few compounds.
As a result we have a group of molecules structurally related to the researched compound.
Such a result may be useful in further research in the search for potential new drugs.
Conclusion
In this notebook we present a technique to analyse the documents provided in search of relevant information about drugs being developed.
A method was developed to find the relevant files among those provided in the challenge.
Subsequently, a routine was developed whose objective is to find the words of interest as well as highlight them in the text and evaluate the context in which they are found.
All of this allows the user to quickly and efficiently search various files of interest.
It was also evaluated the correlation of the molecular structure of the most mentioned compounds among themselves, through clustering.
Finally, the algorithm also searches the public CHEMBL database to find the chemical structure of the studied compound, as well as finding similar structures available in the database for further research by the user.
With the work it was possible to reach the following conclusions:
- Several articles cite therapeutics with the use of drugs from different classes, such as antiretroviral protease inhibitors, corticosteroids, polyene antibiotics, glycosaminoglycans heparan sulfate, proteases and aminoquinoline derivatives.
- There are 23 ongoing clinical trialsTRIALS in China. Chloroquine seems to be effective in limiting the replication of SARS-CoV-2 (virus causing COVID-19) in vitro.(https://www.sciencedirect.com/science/article/pii/S0883944120303907?via%3Dihub)
- Chloroquine was a highly effective treatment for falciparum malaria in The Gambia. High-grade resistance will soon preclude the use of chloroquine in severe malaria.(https://www.thelancet.com/journals/lancet/article/PII0140-6736(92)91645-O/fulltext)
- (Chymo)trypsin-like serine fold proteases belong to the serine/cysteine proteases found in eukaryotes, prokaryotes, and viruses. For that reason, their catalytic activity is carried out using a triad of amino acids, a nucleophile, a base, and an acid. For this superfamily of proteases, they propose the existence of a universal 3D structure comprising 11 amino acids near the catalytic nucleophile and base -Nucleophile-Base Catalytic Zone (NBCZone).(https://www.sciencedirect.com/science/article/pii/S0141813019386854?via%3Dihub)
- The structure models of two severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) proteases, coronavirus endopeptidase C30 (CEP_C30) and papain like viral protease (PLVP), were built by homology modeling. Ritonavir, lopinavir and darunavir were then docked to the models, respectively, followed by energy minimization of the protease-drug complexes. In the simulations, ritonavir can bind to coronavirus endopeptidase C30 (CEP_C30) most suitably, and induce significant conformation changes of CEP_C30; lopinavir can also bind to CEP_C30 suitably, and induce significant conformation changes of CEP_C30; darunavir can bind to PLVP suitably with slight conformation changes of PLVP. It is suggested that the therapeutic effect of ritonavir and lopinavir on COVID-19 may be mainly due to their inhibitory effect on CEP_C30, while ritonavirL may have stronger efficacy ; the inhibitory effect of darunavir on SARS-CoV-2 and its potential therapeutic effect may be mainly due to its inhibitory effect on PLVP. (https://www.biorxiv.org/content/10.1101/2020.01.31.929695v2)
- A total of 26 patients received intravenous administration of methylprednisolone with a dosage of 1-2mg/kg/d for 5-7 days, while the remaining patients not. The average number of days for body temperature back to the normal range was significantly shorter in patients with administration of methylprednisolone when compared to those without administration of methylprednisolone (2.06±0.28 vs. 5.29±0.70, P=0.010). The patients with administration of methylprednisolone had a faster improvement of SpO2, while patients without administration of methylprednisolone had a significantly longer interval of usingUSAGES supplemental oxygen therapy (8.2days[IQR 7.0-10.3] vs. 13.5days(IQR 10.3-16); P<0.001). In terms of chest CT, the absorption degree of the focus was significantly better in patients with administration of methylprednisolone. Our dataTRIALS indicate that in patients with severe COVID-19 pneumonia, early, lowdose and short-term application of corticosteroid was associated with a faster improvement of clinical symptoms and absorption of lung focus. (https://www.medrxiv.org/content/10.1101/2020.03.06.20032342v1)
- It was studied the interaction between the SARS-CoV-2 Spike S1 protein receptor binding domain (SARS-CoV-2 S1 RBD) and heparin. The data demonstrate an interaction between the recombinant surface receptor binding domain and the polysaccharide. This has implications for the rapid development of a first-line therapeutic by repurposing heparin and for next-generation, tailor-made, GAG-based antivirals.(https://www.biorxiv.org/content/10.1101/2020.02.29.971093v1)
The results show that the technique can be used to gain important insights into the about drugs and therapeutics related to coronavirus pandemic, in an agile way and without having to read thousands of full papers.
Pros and cons
About the technique used, we can highlight the ease and speed of obtaining the required information.
As cons, we emphasize that depending on the number of files to be evaluated, the execution of the algorithm can take a while.