Justin Boylan-Toomey

Hello!

Currently I lead the Machine Learning team at the Wellcome Trust, where we develop machine learning models and metrics to support Wellcome fund new discoveries in life, health, and wellbeing. We also support the Wellcome Collection, a museum that explores the connections between medicine, life and art.

Areas I have worked on at Wellcome (working with many fantastic colleagues, teams and external collaborators) include:

Leadership: Since joining Wellcome in 2022 I've implemented development standards, a robust prioritisation process and a technical road map, increasing the delivery of high impact machine learning products and the status of data science within the organisation. Improving our data and MLOps infrastructure alongside setting the direction of the teams technical work. The teams work includes the automatic topic modelling of research publications using BERTopic and the Llama large language model, development of WellcomeBertMesh a transformer model for tagging texts with MeSH terms, using text content and network dynamics to predict translational potential, and the development of network and citation based metrics. You can follow our team's work on the Wellcome Data blog here.
Wellcome Academic Graph: Designed, modelled and developed the Wellcome Academic Graph, a heterogeneous academic graph stored in Neo4j. Capturing over 2 billion relationships between 200 million academic entities, enabling our work to apply and development network based metrics and geometric machine learning.
Vector Database: Created a vector store, a foundational part of our new data infrastructure. Building a large-scale data pipeline with multi-GPU parallelisation with SciBERT and Nvidia RAPIDS for efficient inference and embedding of millions of publication and grant texts for storage in our Milvus vector database.

Other projects I have worked on are:

Document Classification: I made my start in data science at BP where I developed a document clustering pipeline using Apache Tika, TF-IDF, K-Means and keyword extraction to organise and classify their unstructured data stores. Forming the original part of the Document Neighbourhood product, unlocking legacy data with an estimated value of $1 billion. Later I had the opportunity to lead the development of a project using custom embeddings, 1D CNN neural networks for document classification and autoencdoers for novel document detection, deployed at scale on Azure. For my masters dissertation I also explored the use of multimodal fusion methods for combining text content and page images to improve the classification of legacy documents.
Document Extraction: I have also done a lot of work on developing and training new named entity recognition (NER) models with domain specific embeddings for extracting industry specific terms. As well as developing a logistic regression model to assess the quality of different options for the OCR of scanned documents, maps, logs and images.
Data Pipelines: I have also developed a number of large scale text extraction and ETL pipelines on AWS and Azure. Including leading the migration of BP’s data pipeline for third-party hydrocarbon production data, from a proprietary language to PySpark. Using test driven development to deliver a maintainable pipeline with improved functionality.
Data for Good: I regularly volunteer with DataKind, a charity that helps local government and social enterprises use data science to leverage and understand their data. This has included geospatial analysis to help Material Focus improve electrical recycling rates, performing statistical analysis to help improve inclusion in schools and using statistical modelling and comment analysis with Transformers, HDBSCAN and BERTopic to help understand drivers of enjoyment and engagement with activities on an online platform.
Search & Rescue: I spent three years volunteering as a Search and Rescue Technician with Lowland Rescue, searching for and providing assistance to high risk missing people. Including volunteering as the IT Lead for London Search and Rescue, rolling out Office365 across the organisation and later as their Data Protection Officer putting in place GDPR strategy and training.

My first graduate job was providing software support for geoscientific software having graduated with a BSc in geology & petroleum geology from the University of Aberdeen. Later I gained an MSc in data science with distinction from Birkbeck, University of London, my dissertation focused on investigating multi-modal fusion approaches to combining textual and visual features for multi-page document classification using deep Convolutional-LSTM neural networks.

In my spare time I enjoy dabbling in AI art using generative diffusion models, building Raspberry Pi maker projects, cooking, eating and mountain climbing.