On the 12th and 13th September 2019 AI3SD hosted a training workshop and hackathon, intended to help upskill scientists in Data Science, AI and Machine Learning techniques for Chemistry and provide some challenges to test out their new skills.
Day 1 began with some informative presentations:
- Introduction & Information on Datasets and Challenges – Dr Samantha Kanza, Dr Nicola Knight, Professor Simon Coles, Dr Tim Rozday & Dr Nick Lynch
- Data Science Awareness – Steve Brewer
- Progressing from Basic to Advanced Machine Learning – Professor Mahesan Niranjan
The Challenges presented to the teams are detailed below. Full details of each challenge and the datasets provided can be found here.
Solubility Challenge: This was a model building challenge to predict intrinsic aqueous solubilities using the available solubility datasets enhanced by other datasets
Data Mashup Challenge: A challenge to combine data from multiple different sources
Chemical Safety Library Challenge: A challenge to work with the Pistoia Alliance’s Chemical Safety Library Dataset and enhance it with other data sources.
After this teams were formed and challenges were chosen! We had four teams, three of whom chose to do the solubility challenge and one who chose to do the data mashup challenge.
For the solubility challenge the different teams used a variety of methods to address this challenge. One team focused on using dimensionality reduction to select features that are good predictors of solubility. They used NCA to construct a pipeline and plan to evaluate a variety of methods from sklearn. The two others both used PCA, one initially using decision tree analysis, then variance analysis and PCA for feature extraction, followed by using AdaBoost Regression pipelined with anova analysis and PCA feature analysis; the other used scikit-learn and rdkit to work out fingerprints to do some initial machine learning, random forest and PCA. For the data mashup challenge the fourth team worked on a Jupyter notebook to pull spectra and physical data from a variety of different data sources to describe a list of common impurities (like solvents), with the look to facilitating a user importing their own dataset and viewing it alongside the provided spectra.
Each of the teams were very energetic and enthusiastic and put together some very interesting work in a short space of time, and we were very impressed with all of their work. There had to be some winners however! And two teams did stand out for their innovative work, and the results were:
- 1st Place: Team Underachievers who undertook the data mashup challenge!
- 2nd Place: The Insolubles who undertook the solubility challenge!