A Research Project by UCU Centre for Computational Biology: A study about employing phylogenetic tree shape statistics to resolve the underlying host population structure

Ateam of Researchers from the Uganda Christian University Center for Computational Biology recently concluded and published a study on ” employing phylogenetic tree shape statistics to resolve the underlying host population structure”.

The key researchers include the UCU Deputy Vice Chancellor for Academic Affairs, Rev. Dr. John Kitayimbwa, the Executive Director of Uganda Virus Research Institute Dr. Pontiano Kaleebu among others.click here to access the study

Below is an outline of the study:

Abstract
Background: Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools
to reveal the population structure underlying an epidemic. Determining whether a
population is structured or not is useful in informing the type of phylogenetic methods
to be used in a given study. We employ tree statistics derived from phylogenetic trees
and machine learning classifcation techniques to reveal an underlying population
structure.


Results: In this paper, we simulate phylogenetic trees from both structured and nonstructured host populations. We compute eight statistics for the simulated trees, which
are: the number of cherries; Sackin, Colless and total cophenetic indices; ladder length;
maximum depth; maximum width, and width-to-depth ratio. Based on the estimated
tree statistics, we classify the simulated trees as from either a non-structured or a structured population using the decision tree (DT), K-nearest neighbor (KNN) and support
vector machine (SVM). We incorporate the basic reproductive number (R0) in our tree
simulation procedure. Sensitivity analysis is done to investigate whether the classifers
are robust to different choice of model parameters and to size of trees. Cross-validated
results for area under the curve (AUC) for receiver operating characteristic (ROC) curves
yield mean values of over 0.9 for most of the classifcation models.


Conclusions: Our classification procedure distinguishes well between trees from
structured and non-structured populations using the classifers, the two-sample Kolmogorov-Smirnov, Cucconi and Podgor-Gastwirth tests and the box plots. SVM models
were more robust to changes in model parameters and tree size compared to KNN
and DT classifers. Our classification procedure was applied to real -world data and the
structured population was revealed with high accuracy of 92.3% using SVM-polynomial
classifier.

Researchers: Hassan W. Kayondo, Alfred Ssekagiri, Grace Nabakooza, Nicholas Bbosa, Deogratius Ssemwanga, Pontiano Kaleebu, Samuel Mwalili , John M. Mango , Andrew J. Leigh Brown , Roberto A. Saenz, Ronald Galiwango and John M. Kitayimbwa