Run your Package Native Code model

Suggest edits
Documentation > Run

Contents

When your model is written in a language for which a specific OpenMOLE task doesn't exist, or if it uses an assembly of tools, libraries, binaries, etc. you have to package it in a single piece of executable code, so that you can send it to any machine (with potentially varying OS installations). To make your code portable from machine to machine (Linux to Linux) you may either use containers or the CARE packaging tool.

This section intends to be general and seeks to explain how to embed any native code into OpenMOLE. More specific documentation pages on how to package some languages/platforms are available for:
If you have an example of a successful packaging for a language or software that may benefit other users, share it! For that, check how to contribute.

Packaging with CARE 🔗

In OpenMOLE, a generic task named CARETask allows to run external applications packaged with CARE. Documentation can be found here.

CARE makes it possible to package your application from any Linux computer, and then re-execute it on any other Linux computer. The CARE/OpenMOLE pair is a very efficient way to distribute your application on a large scale with very little effort.
Please note that this packaging step is only necessary if you plan to distribute your workflow on an heterogeneous computing environment such as the EGI grid. If you target local clusters running the same operating system and sharing a network file system, you can directly jump to the SystemExecTask.

Install CARE 🔗

  • download the CARE tool here
  • make it executable: chmod +x care
  • add the path to the executable to your PATH variable: export PATH=/path/to/the/care/folder:$PATH

Embed an application 🔗

The CARETask was designed to embed native binaries such as programs compiled from C, C++, Fortran, Python, R, Scilab, etc. Embedding an application in a CARETask happens in 2 steps: first, packaging it with CARE, and then importing it in OpenMOLE.
Packaging an application
First, you should package your application using the CARE binary you just installed, so that it executes on any Linux environment. This usually consists in prepending your command line with: care -o /path/to/myarchive.tgz.bin -r ~ -p /path/to/mydata1 -p /path/to/mydata2 mycommand myparam1 myparam2. Before going any further, here are a few notes about the options accepted by CARE:
  • -o indicates where to store the archive. At the moment, OpenMOLE prefers to work with archives stored in .tgz.bin, so please don't toy with the extension ;-)
  • -r ~ is not compulsory but it has proved mandatory in some cases. So as a rule of thumb, if you encounter problems when packaging your application, try adding/removing it.
  • -p /path asks CARE not to archive /path. This is particularly useful for input data that will change with your parameters. You probably do not want to embed this data in the archive, and we'll see further down how to inject the necessary input data in the archive from OpenMOLE.
Importing a packaged application in OpenMOLE
Second, just provide the resulting package along with some other information to OpenMOLE. Et voilà! If you encounter any problem to package your application, please refer to the corresponding entry in the FAQ.

One very important aspect of CARE is that you only need to package your application once. As long as the execution you use to package your application makes uses of all the dependencies (libraries, packages...), you should not have any problem re-executing this archive with other parameters.

Advanced Options 🔗

Return value
The CARETask can be customised to fit the needs of a specific application. For instance, some applications disregarding standards might not return the expected 0 value upon completion. The return value of the application is used by OpenMOLE to determine whether the task has been successfully executed, or needs to be re-executed. Setting the boolean flag errorOnReturnValue to false will prevent OpenMOLE from re-scheduling a CARETask that has reported a return code different from 0. You can also get the return code in a variable using the returnValue setting.
Standard and error outputs
Another default behaviour is to print the standard and error outputs of each task in the OpenMOLE console. Such raw prints might not be suitable when a very large number of tasks is involved, or when further processes are to be performed on the outputs. A CARETask's standard and error outputs can be assigned to OpenMOLE variables and thus injected in the data flow by summoning respectively the stdOut and stdErr actions on the task.
Environment variables
As any other process, the applications contained in OpenMOLE's native tasks accept environment variables to influence their behaviour. Variables from the data flow can be injected as environment variables using the environmentVariable += (variable, "variableName") field. If no name is specified, the environment variable is named after the OpenMOLE variable. Environment variables injected from the data flow are inserted in the pre-existing set of environment variables from the execution host. This shows particularly useful to preserve the behaviour of some toolkits when executed on local environments (ssh, clusters...) where users control their work environment.

The following snippet creates a task that employs the features described in this section:

// Declare the variable
val output = Val[String]
val error  = Val[String]
val value = Val[Int]

// Any task
val pythonTask =
  CARETask("hello.tgz.bin", "python hello.py") set (
    stdOut := output,
    stdErr := error,
    returnValue := value,
    environmentVariable += (value, "I_AM_AN_ENV_VAR")
  )

You will note that options holding a single value are set using the := operator. Also, the OpenMOLE variables containing the standard and error outputs are automatically marked as outputs of the task, and must not be added to the outputs list.

Native API 🔗

You can configure the execution of the CARETask using the set operator on a freshly defined task.

val out = Val[Int]

val careTask = CARETask("care.tgz.bin", "executable arg1 arg2 /path/to/my/file /virtual/path arg4") set (
  hostFiles += ("/path/to/my/file"),
  customWorkDirectory := "/tmp",
  returnValue := out
)

The available options are described hereafter:
  • hostFiles: takes the path of a file on the execution host and binds it to the same path in the CARE filesystem. Optionally you can provide a second argument to specify the path explicitly. Example: hostFiles += ("/etc/hosts") or with a specific path hostFiles += ("/etc/bash.bashrc", "/home/foo/.bashrc").
  • environmentVariables: is used to set the value of an environment variable within the context of the execution. Example: environmentVariables += ("VARIABLE1", "42"). Multiple hostFiles entries can be used within the same set block.
  • workDirectory: sets the directory within the archive where to start the execution from. Example: workDirectory := "/tmp".
  • returnValue: captures the return code of the execution in an OpenMOLE Val[Int] variable. Example: returnValue := out.
  • errorOnReturnValue: tells OpenMOLE to ignore a return code different from 0. The task won't be resubmitted. Example: errorOnReturnValue := false.
  • stdOut: captures the standard output of the execution in an OpenMOLE Val[String] variable. Example: stdOut := output.
  • stdErr: captures the error ouput of the execution in an OpenMOLE Val[String] variable. Example: stdErr := error.

Using local Resources 🔗

To access data present on the execution node (outside the CARE filesystem) you should use a dedicated option of the CARETask: hostFiles. This option takes the path of a file on the execution host and binds it to the same path in the CARE filesystem. Optionally you can provide a second argument to specify the path explicitly. For instance:

val careTask = CARETask("care.tgz.bin", "executable arg1 arg2 /path/to/my/file /virtual/path arg4") set (
  hostFiles += ("/path/to/my/file"),
  hostFiles += ("/path/to/another/file", "/virtual/path")
)

This CAREtask will thus have access to /path/to/my/file and /virtual/path.

Using local executable 🔗

The CARETask was designed to be portable from one machine to another. However, some use-cases require executing specific commands installed on a given cluster. To achieve that you should use another task called SystemExecTask. This task is made to launch native commands on the execution host. There is two modes for using this task:
  • Calling a command that is assumed to be available on any execution node of the environment. The command will be looked for in the system as it would from a traditional command line: searching in the default PATH or an absolute location.
  • Copying a local script not installed on the remote environment. Applications and scripts can be copied to the task's work directory using the resources field. Please note that contrary to the CARETask, there is no guarantee that an application passed as a resource to a SystemExecTask will re-execute successfully on a remote environment.

The SystemExecTask accepts an arbitrary number of commands. These commands will be executed sequentially on the same execution node where the task is instantiated. In other words, it is not possible to split the execution of multiple commands grouped in the same SystemExecTask.

The following example first copies and runs a bash script on the remote host, before calling the remote's host /bin/hostname. Both commands' standard and error outputs are gathered and concatenated to a single OpenMOLE variable, respectively stdOut and stdErr:

// Declare the variable
val output = Val[String]
val error  = Val[String]

// Any task
val scriptTask =
  SystemExecTask("bash script.sh", "hostname") set (
    resources += workDirectory / "script.sh",
    stdOut := output,
    stdErr := error
  )

 scriptTask hook ToStringHook()

In this case the bash script might depend on applications installed on the remote host. Similarly, we assume the presence of /bin/hostname on the execution node. Therefore this task cannot be considered as portable.
Note that each execution is isolated in a separate folder on the execution host and that the task execution is considered as failed if the script returns a value different from 0. If you need another behaviour you can use the same advanced options as the CARETask regarding the return code.

File management 🔗

To provide files as input of a CARETask or SystemExecTask and to get files produced by these task, you should use the inputFiles and outputFiles keywords. See the documentation on file management.

CARE Troubleshooting 🔗

You should always try to re-execute your application outside of OpenMOLE first. This allows you to ensure the packaging process with CARE was successful. If something goes wrong at this stage, you should check the official CARE documentation or the archives of the CARE mailing list.
If the packaged application re-executes as you'd expect, but you still struggle to embed it in OpenMOLE, then get in touch with our user community via the OpenMOLE user forum or the chat.

Research paper 🔗

Using CARE and OpenMOLE for reproducible science has been covered in the following paper :
Jonathan Passerat-Palmbach, Romain Reuillon, Mathieu Leclaire, Antonios Makropoulos, Emma C. Robinson, Sarah Parisot and Daniel Rueckert, Reproducible Large-Scale Neuroimaging Studies with the OpenMOLE Workflow Management System, published in Frontiers in Neuroinformatics Vol 11, 2017.
[online version] [bibteX]

Packaging with Docker 🔗

The ContainerTask runs docker container in OpenMOLE. It is based on a container engine running without administration privileges called udocker. Using this task, you can run containers from the docker hub, or containers exported from docker.

Containers from the docker hub 🔗

A simple task running a python container would look like:

val container =
  ContainerTask(
    "python:3.6-stretch",
    """python -c 'print("youpi")'"""
  )

You can run this task. At launch time, it downloads the python image from the docker hub in order to be able to run it afterwards.

Let's imagine a slightly more complete example: we will use the following python script, which uses the numpy library to multiply a matrix (stored in a csv file) by a scalar number.

import sys
import numpy
from numpy import *
from array import *
import csv

input = open(sys.argv[1],'r')
n = float(sys.argv[2])

print("reading the matrix")
data = csv.reader(input)
headers = next(data, None)

array = numpy.array(list(data)).astype(float)

print(array)
print(n)
mult = array * n

print("saving the matrix")
numpy.savetxt(sys.argv[3], mult, fmt='%g')

An example input file would look like this:

col1,col2,col3
31,82,80
4,48,7

For this example, we consider that we have a data directory, containing a set of csv files in the workDirectory. We want to compute the python script for each of this csv file and for a set of values for the second argument of the python script. The OpenMOLE workflow would then look like this:

val dataFile = Val[File]
val dataFileName = Val[String]
val i = Val[Int]
val resultFile = Val[File]

val container =
  ContainerTask(
    "python:3.6-stretch",
    """python matrix.py data.csv ${i} out.csv""",
    install = Seq("pip install numpy")
  ) set (
    resources += workDirectory / "matrix.py",
    inputFiles += (dataFile, "data.csv"),
    outputFiles += ("out.csv", resultFile),
    (inputs, outputs) += (i, dataFileName)
  )

DirectSampling(
  sampling =
    (dataFile in (workDirectory / "data") withName dataFileName) x
    (i in (0 to 3)),
  evaluation =
    container hook CopyFileHook(resultFile, workDirectory / "results/${dataFileName.dropRight(4)}_${i}.csv")
)


The install parameter contains a set of command used to install some components in the container once and for all, when the task is instantiated.