Equal Size kmeans

We were recently presented with a problem where the decision maker wanted to understand how their data would naturally group together. The classic technique of k-means clustering was a natural choice; it’s well known, computationally efficient, and implemented in base R via the kmeans() function. Our problem has a slight wrinkle: the decision maker wished to see the data grouped with (nearly) equal sizes. Now, a ‘true’ statistician would tell the client that the right thing to do from a theoretical perspective was to use native k-means results because some centers can simply have more nearby points than other centers.

Read more

Share Comments · · ·

reticulate, virtualenv, and Python in Linux

Roland Stevenson is a data scientist and consultant who may be reached on Linkedin. reticulate is an R package that allows us to use Python modules from within RStudio. I recently found this functionality useful while trying to compare the results of different uplift models. Though I did have R’s uplift package producing Qini charts and metrics, I also wanted to see how things looked with Wayfair’s promising pylift package.

Read more

Share Comments · · · ·

Introducing DeclareDesign, a Platform for Research Design

Research design consists of a set of choices about what research procedures to use. For example, how many subjects to interview, which questions to ask them, and what to do in the analysis phase with the data that results from these choices. We do not have good tools for assessing whether the chosen procedures are good ones. DeclareDesign is an R package for learning about, implementing, and communicating research procedures, from data collection to data analysis.

Read more

Share Comments · · ·

April 2019: "Top 40" New CRAN Packages

One hundred eighty-seven new packages made it to CRAN in April. Here are my picks for the “Top 40”, organized into ten categories: Biotechnology, Data, Econometrics, Machine Learning, Medicine, Science, Statistics, Time Series, Utilities, and Visualization. Biotechnology genpwr v1.00: Provides functions for power and sample size calculations for genetic association studies allowing for mis-specification of the model of genetic susceptibility. The methods employed are extensions of Gauderman (2002) and Gauderman (2002).

Read more

Share Comments · · · ·

Momentum Investing with R

After an extended hiatus, Reproducible Finance is back! We’ll celebrate by changing focus a bit and coding up an investment strategy called Momentum. Before we even tiptoe in that direction, please note that this is not intended as investment advice and it’s not intended to be a script that can be implemented for trading.

Read more

Share Comments · · ·

Analysing the HIV pandemic, Part 4: Classification of lab samples

This is part 4 of a four-part series about the HIV epidemic in Africa. In this final part, we discuss how genetic diversity can be used to classify laboratory samples into either inter-patient or intra-classes, using logistic regression. This helps with quality in the lab, since it’s possible to match new samples with samples from the same patient, taken years apart and allowing for mutation of the HIV virus genomic sequence.

Read more

Share Comments · · ·

Analysing the HIV pandemic, Part 3: Genetic diversity

This is part 3 of a four-part series about the HIV epidemic in Africa. In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug-resistance testing facility. In this part, we discuss genetic diversity and how this can be analysed using markov chains and heatmaps.

Read more

Share Comments · · ·

Virtual Morel Foraging with R

Enjoy a virtual mushroom hunt with R and RSelenium, which allows R to use a web browser as a human would, including clicking on buttons, etc.

Read more

Share Comments · · · ·

Analysing the HIV pandemic, Part 2: Drug resistance testing

This is part 2 of a four-part series about the HIV epidemic in Africa. In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug-resistance testing facility. Part 2 discusses drug-resistance testing of HIV isolates in sub-Saharan Africa.

Read more

Share Comments · · ·

Analysing the HIV pandemic, Part 1: HIV in sub-Sahara Africa

The Human Immunodeficiency Virus (HIV) is the virus that causes acquired immunodeficiency syndrome (AIDS). The virus invades various immune cells, causing loss of immunity, and thus increases susceptibility to infections, including Tuberculosis and cancer. In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug resistance testing facility. In this first of a series of four posts, we highlight the serious problem of HIV infection in sub-Saharan Africa, with special analysis of the situation in South Africa. The subsequent posts will describe the phylogenetics pipeline (running on a Raspberry Pi), and the analysis of viral sequences using R.

Read more

Share Comments · · ·