Plug your R model

Suggest edits
Documentation > Plug

Content:



RTask 🔗

R is a scripted language initially designed for statistics, but whose application range is much broader today (for example GIS, operational research, linear algebra, web applications, etc.), thanks to its large community and the variety of packages. It may be convenient to use specific R libraries within a workflow, and therefore OpenMOLE provides a specific RTask.

Preliminary remark 🔗

The RTask uses the Singularity container system. You should install Singularity on your system otherwise you won't be able to use it.

RTask syntax 🔗

The RTask relies on an underlying ContainerTask but is designed to be transparent and takes only R-related arguments. The current version of R used is 4.0.2. It takes the following arguments :
  • script String, mandatory. The R script to be executed, either R code directly or a R script file.
  • install Sequence of strings, optional (default = empty). The commands to be executed prior to any R packages installation and R script execution (see example below: some R libraries may have system dependencies, that have to be installed first).
  • libraries Sequence of strings, optional (default = empty). The name of R libraries that will be used by the script and need to be installed beforehand (note: as detailed below, installations are only done during the first execution of the R script, and then stored in a cached docker image.
  • clearContainerCache Boolean, optional (default = false). Should the R image and libraries be cleared and reinstalled (to ensure an update for example)? If true, the task will perform the installation (and thus the update) even if the library was already installed.

The following properties must be defined using set:
  • input/output similar to any other task,
  • mapped input: the syntax inputs += om-variable mapped "r-variable" establishes a link between the workflow variable om-variable (Val) and the corresponding R variable named r-variable (as a String). If variables have the same name, you can use the short syntax inputs += my-variable.mapped,
  • mapped output: similar syntax as inputs to collect outputs of the model.

We develop below a detailed example of how to use a RTask, from a very simple use case to a more elaborate one, with system libraries and R libraries.

Execute R code 🔗

The toy R script for this first test case is the following:

# Define the function
f <- function(x) {
    x + 1
}

# Use the function
j <- f(2)

This script creates a function f that takes a parameter (a number) and adds 1 to it. It then applies the function to the number 2. We save this to a file named myRScript.R in our OpenMOLE workspace.

Write R code in the RTask 🔗

For our first example, we write the R script directly in the RTask.

// Declare variables
val result = Val[Int]

// Task
val rTask1 = RTask("""
    # Here you write your R code

    # Define the function
    f <- function(x) {
        x + 1
    }

    # Use the function
    j <- f(2)
""") set (
    outputs += result mapped "j"
)

// Workflow
rTask1 hook display

We provide the result variable to store the result of the function execution j, and we display its value in the standard output through hook display.

Running R code from a script 🔗

Instead of writing the R code in the RTask, we can call an external R script containing the code to be executed. We will use the file myRScript.R created earlier. It needs to be uploaded in the OpenMOLE workspace.
All the code is in the R script
If all the R code you need is written in your R script, you just need to provide the path to this script.

// Declare variables
val result = Val[Int]

// Task
val rTask2 = RTask(script = workDirectory / "myRScript.R") set (
    outputs += result mapped "j"
)

// Workflow
rTask2 hook display

This workflow should return the exact same result as the previous example.
Additional R code is needed
If you need additional R code besides what is included in your script, you need a mix of the first two examples. We will need to write R code and thus use the syntax from the first example, while also providing an external R script.
In order to use the R script, we need to use the resources field with the precise location of the file in our work directory. It will then be imported in the RTask by the R primitive source("myRScript.R")).

// Declare variables
val result = Val[Int]

// Task
val rTask3 = RTask("""
    # Import the external R script
    source("myRScript.R")

    # Add some code
    k <- j + 2
""") set (
    resources += (workDirectory / "myRScript.R"),
    outputs += result mapped "k"
)

// Workflow
rTask3 hook display

This time, we modify the output of the R script (by adding 2 to the result) before returning a value to OpenMOLE.

Provide input values 🔗

We want to be able to define inputs to the RTask externally, and to store the output values.

Mapped values 🔗

It is possible to do so through the inputs and outputs parameters in the set part of the task.

// Declare variables
val myInput = Val[Int]
val myOutput = Val[Int]

// Task
val rTask4 = RTask("""
    # Define the function
    f <- function(x) {
        x + 1
    }

    # Use the function
    j <- f(i)
""") set (
    inputs += myInput mapped "i",
    outputs += myOutput mapped "j",

    // Default value for the input
    myInput := 3
)

// Workflow
rTask4 hook display

Here, i and j are R variables defined and used in the R code, while myInput and myOutput are OpenMOLE variables. The syntax om-variable mapped "r-variable" creates a link between the two, indicating that these should be considered the same in the workflow.

If your OpenMOLE variable and R variable have the same name (say my-variable for instance), you can use the following shortcut syntax: my-variable.mapped.

Combine mapped and classic inputs/outputs 🔗

If you have several outputs, you can combine mapped outputs with classic outputs that are not part of the RTask:

// Declare variables
val i = Val[Int]
val j = Val[Double]
val c = Val[Double] // c is not used in the RTask

// Task
val rTask5 =
RTask("""
    # Define the function
    f <- function(x) {
        x + 1
    }

    # Use the function
    j <- f(i)
""") set (
    inputs += i.mapped,
    inputs += c,

    outputs += i, // i doesn't need to be mapped again, it was done just above
    outputs += j.mapped,
    outputs += c,

    // Default values
    i := 3,
    c := 2
)

// Workflow
rTask5 hook display

This technique can be used when you have a chain of tasks and you want to use a hook. Indeed, the hook only captures outputs of the last executed task, thus we can add a variable of interest in the output of the task even if it does not appear in this task.

Working with files 🔗

It is possible to use files as arguments of a RTask. The inputFiles keyword is used. We emphasize that inputFiles is different from resources, which was used to import external R scripts. inputFiles is used to provide OpenMOLE variables of type File that can be acted upon in a workflow.

In this example workflow, we first have a ScalaTask writing numbers in a file. The file is created through the OpenMOLE variable myFile of type java.io.File. In order to have access to this file in the RTask, we add myFile as an output of the ScalaTask and an input of the RTask.

// Declare variable
val myFile = Val[File]
val resR = Val[Array[Double]]

// ScalaTask creating the file myFile
val task1 = ScalaTask("""
    val myFile = newFile()
    myFile.content = "3 6 4"
""") set (
    outputs += myFile
)

// RTask using myFile as an input
val rTask5 = RTask("""
    myData <- read.table("fileForR.txt", sep = " ")
    myVector <- as.vector(myData, mode = "numeric")

    f <- function(x) {
        x + 1
    }

    k <- f(myVector)
""") set(
    inputFiles += (myFile, "fileForR.txt"),
    outputs += resR mapped "k"
)

// Workflow
task1 -- (rTask5 hook display)

The R script in the RTask reads a file named fileForR.txt (in the R script presented here, it is supposed to have numeric values, separated by a simple space), and creates a R variable myVector, which is a vector containing the values of the file fileForR.txt. We then apply the function f to that vector.
The fileForR.txt file is set as an input file of the RTask following the syntax: inputFiles += (om-fileVariable, "filename-in-R-code"). For more information about file management in OpenMOLE, see this page.
The end of the workflow simply tells OpenMOLE to chain the two tasks and to display the outputs of the last task (here the OpenMOLE variable resR) in the standard output.

Using libraries 🔗

Here we give an example of how to use a library in a RTask. We use the function CHullArea of the library GeoRange to compute the area in the convex envelop of a set of points.

We need to write the names of the libraries we need in the field libraries, as a sequence, and they will be installed from the CRAN repository. You might need to write or adapt the install field accordingly. It is a sequence of system commands which are executed prior to the installation of the R libraries on the machine which executes OpenMOLE. It can be used to install the system packages which are required by the R libraries.
The RTask is based on a Debian container, therefore you can use any Debian command here including apt installation tool.

Note: the first time you use R with libraries or packages, it takes some time to install them, but for the next uses those libraries will be stored, and the execution will be quicker.

// Declare variable
val area = Val[Double]

// Task
val rTask6 = RTask("""
    library(GeoRange)

    n <- 40
    x <- rexp(n, 5)
    y <- rexp(n, 5)

    # To have the convex envelop of the set of points we created
    liste <- chull(x, y)
    hull <- cbind(x, y) [liste,]

    # require GeoRange
    area <- CHullArea(hull[, 1], hull[, 2])
    """,
    install = Seq("fakeroot apt-get update", "fakeroot apt-get install -y libgdal-dev libproj-dev"),
    libraries = Seq("GeoRange")
) set(
    outputs += area.mapped
)

// Workflow
rTask6 hook display

Advanced RTask usage 🔗

Use a library within Docker 🔗

If you are starting OpenMOLE within docker, installing R packages in a RTask might require a slighlty different parameter setting. If you compare our example below with the example of library above, you'll observe differences in the install field: we prefix install commands with fakeroot to get the permissions to use the Debian command apt for installation.

// Declare variable
val area = Val[Double]

// Task
val rTask7 = RTask("""
    library(GeoRange)

    n <- 40
    x <- rexp(n, 5)
    y <- rexp(n, 5)

    # To have the convex envelop of the set of points we created
    liste <- chull(x, y)
    hull <- cbind(x, y) [liste,]

    # require GeoRange
    area <- CHullArea(hull[, 1], hull[, 2])
    """,
    install = Seq("fakeroot apt-get update", "fakeroot apt-get install -y libgdal-dev libproj-dev"),
    libraries = Seq("GeoRange")
) set(
    outputs += area.mapped
)

// Workflow
rTask7 hook display

Use of HTTP proxy 🔗

If you start OpenMOLE behind a HTTP proxy, you are probably familiar already with the --proxy parameter you can add to the OpenMOLE command line, which makes OpenMOLE use your proxy when downloading anything from the web. You can use it like openmole --proxy http://myproxy:3128. This proxy will also be used by OpenMOLE to download any container, including the containers used behind the curtain to run a RTask. This proxy will also be used by the RTask to download packages from the web.

Use alternative Debian repositories 🔗

We show how using the install parameter of a RTask enables to use Debian installation tools such as apt to install packages in the container running R. This downloads Debian packages from the default international repositories (servers) for Debian. In some cases, you might be willing to use alternative repositories.

A first reason might be sleep: download and installation of packages might require hundreds of megabytes of download, leading to an important consumption of data and a slower construction of the container (only at the first execution, as the container is reused for further executions). If your institution is running a local Debian repository, you would save data and time by using this repository. You might also need packages which are not part of the default Debian repositories.

You can do so by making a smart use of the install parameter to define your own repositories as shown in the example below.

// Declare variable
val area = Val[Double]

// Task
val rTask8 = RTask("""
    library(ggplot2)
    library(gganimate)

    # your R script here
    # [...]
    """,
    install = Seq(
       // replace the initial Debian repositories by my repository
       "fakeroot sed -i 's/deb.debian.org/linux.myinstitute.org/g' /etc/apt/sources.list",
       // display the list on the console so I can double check what happens
       "fakeroot cat /etc/apt/sources.list",
       // update the list of available packages (here I disable HTTP proxy as this repository is in my network)
       "fakeroot apt-get -o Acquire::http::proxy=false update ",
       // install required R packages in their binary version (quicker, much stable!)
       "DEBIAN_FRONTEND=noninteractive fakeroot apt-get -o Acquire::http::proxy=false install -y r-cran-ggplot2",
       "DEBIAN_FRONTEND=noninteractive fakeroot apt-get -o Acquire::http::proxy=false install -y r-cran-gganimate",
       "DEBIAN_FRONTEND=noninteractive fakeroot apt-get -o Acquire::http::proxy=false install -y r-cran-plotly",
       "DEBIAN_FRONTEND=noninteractive fakeroot apt-get -o Acquire::http::proxy=false install -y r-cran-ggally",
       // install the libs required for the compilation of R packages
       "DEBIAN_FRONTEND=noninteractive fakeroot apt-get -o Acquire::http::proxy=false install -y libssl-dev libcurl4-openssl-dev libudunits2-dev",
       // install ffmpeg to render videos
       "DEBIAN_FRONTEND=noninteractive fakeroot apt-get -o Acquire::http::proxy=false install -y ffmpeg"
       ), //
    libraries = Seq("ggplot2", "gganimate", "plotly", "GGally")
) set(
    outputs += area.mapped
)

// Workflow
rTask8 hook display