SelectBoost was born out of a need: to obtain a selection of the most relevant possible variables from a large dataset. Frédéric Bertrand, a mathematics researcher, is working with Seiamak Bahram and Raphaël Carapito and their teams from the Centre de Recherche d’Immunologie et d’Hématologie (Inserm – University of Strasbourg) on the analysis of a gene network. “We are working on certain types of cancer and we want to find out which interactions between genes produce more or less aggressive cells”, explains the researcher.
There are thousands of possible interactions between genes and the researchers must propose a small number of them that are then tested in the laboratory. “Our results are tested by biologists using complex procedures. We must not make any mistakes.“
However, as the mathematician explains, the problem with the methods used to select variables is that they are very unstable. He uses the example of the study of a city’s atmospheric conditions: in order to determine the ozone level at midday in a specific geographical area, researchers have various data at their disposal. They can choose to observe the temperature, wind direction or traffic levels.
If the temperature varies from one day to the next, the selection method may select it as a relevant variable at 20.5 degrees, but not at 21. Classical methods are therefore very dependent on the values of variables at the time when they are observed.
In the case of the gene network, the expressions of the genes and their impacts on each other can vary. SelectBoost aims to determine which genes are the most likely to influence the malignancy of cancer cells.
Choosing the right variables in order to influence them
SelectBoost tests the variables several times, each time slightly modifying the data and taking their correlations into account, in order to make a more informed selection. “Once the most stable and pertinent variables have been selected, it is then possible to act on some of them. If, in the case of high ozone levels, temperature and road traffic are the causes, we can concentrate our efforts on regulating the number of vehicles on the road”, concludes the mathematician.
From the time they obtained the first conclusive results, the team realised the potential of the method: it could help any analyst working on a dataset. “The only drawback with SelectBoost is that it has a certain computational cost. If it normally takes five minutes to test the model, that time has to be multiplied by the number of times that the data are modified to determine the correct variables.” However, there is a parallel version of the algorithm that greatly reduces that limitation.
In order to enable the entire research community to benefit from SelectBoost, Frédéric Bertrand has posted his code online with free access. It requires the use of R software, which is well known to statisticians.
- The SelectBoost code is available via this URL or this one, and a website dedicated to the package can be found there.
- The full research paper in English is available on open access here.