Create a pipeline

This guide explains how to create an example pipeline that’s closer to a typical use-case of a Nextflow bioinformatics pipeline.

Please review the VDSL3 principles section for the necessary background.

Get the template project

To get started with building a pipeline, we provide a template project which already contains a few components. First create a new repository by clicking the “Use this template” button in the viash_project_template repository or clicking the button below.

Use project template

Then clone the repository using the following command.

git clone https://github.com/youruser/my_first_pipeline.git

The pipeline contains three components and uses two utility components from vsh_utils with which we will build the following pipeline:

graph TD
   A(file?.tsv) --> X[vsh_flatten] 
   X --file1.tsv--> B1[/remove_comments/] --> C1[/take_column/] --> Y
   X --file2.tsv--> B2[/remove_comments/] --> C2[/take_column/] --> Y
   Y[vsh_toList] --> D[/combine_columns/]
   D --> E(output)

  • vsh_flatten is a component to transform a Channel event containing multiple files (in this case using a glob ?) into multiple Channel events each containing one file to operate on. It is a Viash-compatible version of the Nextflow flatten operator.
  • remove_comments is a Bash script which removes all lines starting with a # from a file.
  • take_column is a Python script which extracts one of the columns in a TSV file.
  • vsh_toList is a component/module that does the oposite as vsh_flatten: turn multiple Channel items into one Channel item containing a list.
  • combine_columns is an R script which combines multiple files into a TSV.

Build the VDSL3 modules and workflow

First, we need to build the components into VDSL3 modules. Since Viash version 0.8.x this includes the workflows and subworkflows themselves as well since they are (or better /can/ be) stored under src and built to target/.

viash ns build --setup cachedbuild --parallel
temporaryFolder: /tmp/viash_hub_repo18120050040228640857 uri: https://viash-hub.com/data-intuitive/vsh-pipeline-operators.git
Cloning into '.'...
checkout out: List(git, checkout, tags/v0.2.0, --, .) 0 
Exporting workflow (template) =nextflow=> /home/runner/work/website/website/guide/_viash_project_template/target/nextflow/template/workflow
Exporting take_column (template) =docker=> /home/runner/work/website/website/guide/_viash_project_template/target/docker/template/take_column
Exporting combine_columns (template) =docker=> /home/runner/work/website/website/guide/_viash_project_template/target/docker/template/combine_columns
Exporting combine_columns (template) =nextflow=> /home/runner/work/website/website/guide/_viash_project_template/target/nextflow/template/combine_columns
[notice] Building container 'ghcr.io/viash-io/viash_project_template/template/take_column:0.2.3' with Dockerfile
[notice] Building container 'ghcr.io/viash-io/viash_project_template/template/combine_columns:0.2.3' with Dockerfile
Exporting take_column (template) =nextflow=> /home/runner/work/website/website/guide/_viash_project_template/target/nextflow/template/take_column
Exporting remove_comments (template) =docker=> /home/runner/work/website/website/guide/_viash_project_template/target/docker/template/remove_comments
[notice] Building container 'ghcr.io/viash-io/viash_project_template/template/remove_comments:0.2.3' with Dockerfile
Exporting remove_comments (template) =nextflow=> /home/runner/work/website/website/guide/_viash_project_template/target/nextflow/template/remove_comments
All 7 configs built successfully

For more information about the --setup and --parallel arguments, please refer to the reference section.

The output of viash ns build tells us that

  1. two dependencies are fetched (from Viash Hub)
  2. the locally defined components are built into Nextflow modules
  3. the locally defined worfklow template/workflow is built (see later)
  4. containers are built for the local modules

Once viash ns build is finished, a new target directory has been created containing the executables and modules grouped per platform:

tree target
target
├── dependencies
│   └── vsh
│       └── data-intuitive
│           └── vsh-pipeline-operators
│               └── v0.2.0
│                   └── nextflow
│                       └── join
│                           └── vsh_toList
│                               ├── main.nf
│                               └── nextflow.config
├── docker
│   └── template
│       ├── combine_columns
│       │   └── combine_columns
│       ├── remove_comments
│       │   └── remove_comments
│       └── take_column
│           └── take_column
└── nextflow
    └── template
        ├── combine_columns
        │   ├── main.nf
        │   └── nextflow.config
        ├── remove_comments
        │   ├── main.nf
        │   └── nextflow.config
        ├── take_column
        │   ├── main.nf
        │   └── nextflow.config
        └── workflow
            ├── main.nf
            └── nextflow.config

19 directories, 13 files

Import a VDSL3 module

Viash version 0.8 and beyond

Note

This functionality is available since Viash version 0.8.x and assumes the workflow code is encoded as a Viash component with a corresponding config.vsh.yaml config file.

In order to use a module or subworkflow one simply has to add the module (either local or remote) to the dependencies slot in the Viash config file, for example:

functionality:
  dependencies: 
    - name: template/combine_columns
      repository: local

  repositories:
    - name: local
      type: local

After that, the module will be included automatically during the Viash build stage. For more information, please refer to the reference.

All Viash versions

As illustrated by the tree output above, a module can be included by pointing to its location. This approach can be used for any Nextflow module (that exposes a compatible API):

include { remove_comments } from "./target/nextflow/template/remove_comments/main.nf"

Create a pipeline

All Viash versions

We can use a module in a conventional Nextflow pipeline which takes two input files (file1 and file2) and removes the lines that contain comments (lines starting with #) from those files:

include { remove_comments } from "./target/nextflow/template/remove_comments/main.nf"

workflow {

  // Create a channel with two events
  // Each event contains a string (an identifier) and a file (input)
  Channel.fromList([
      ["file1", [ input: file("resources_test/file1.tsv") ] ],
      ["file2", [ input: file("resources_test/file2.tsv") ] ]
    ])

    // View channel contents
    | view { tup -> "Input: $tup" }
    
    // Process the input file using the 'remove_comments' module.
    // This removes comment lines from the input TSV.
    | remove_comments.run(
      directives: [
        publishDir: "output/"
      ]
    )

    // View channel contents
    | view { tup -> "Output: $tup" }
}

In plain English, the workflow works as follows:

  1. Create a Channel with 2 items, corresponding to 2 input files.
  2. Specify the respective input files as corresponding to the --input argument: [ input: ... ].
  3. Add a view operation for introspection of the Channel
  4. Run the remove_comments step and publish the results to output/. No additional fromState and toState arguments are specified because the defaults suffice.
  5. One more view to show the resulting processed Channel items.

We point the reader to the VDSL3 principles section for more information about how data flow (aka state) is management in a VDSL3 workflow.

Pipeline as a component

The run() function is a unique feature for every VDSL3 module which allows dynamically altering the behaviour of a module from within the pipeline. For example, we use it to set the publishDir directive to "output/" so the output of that step in the pipeline will be stored as output.

Note

This functionality is available since Viash version 0.8.x.

We can do the same but this time encoding the pipeline as a Viash compoment itself:

workflow run_wf {
  take:
    input_ch

  main:

    output_ch = 

      // Create a channel with two events
      // Each event contains a string (an identifier) and a file (input)
      Channel.fromList([
          ["file1", [ input: file("resources_test/file1.tsv") ] ],
          ["file2", [ input: file("resources_test/file2.tsv") ] ]
        ])

        // View channel contents
        | view { tup -> "Input: $tup" }
        
        // Process the input file using the 'remove_comments' module.
        // This removes comment lines from the input TSV.
        | remove_comments

        // View channel contents
        | view { tup -> "Output: $tup" }

  emit:
    output_ch
      | map{ id, state -> [ "run", state ] }
}

Together with a config file like this one:

functionality:
  name: test
  namespace: template
  description: |
    An example pipeline and project template.

  arguments:
    - name: "--output"
      alternatives: [ "-o" ]
      type: file
      direction: output
      required: true
      description: Output TSV file
      example: output.tsv

  resources:
    - type: nextflow_script
      path: main.nf
      entrypoint: run_wf

  dependencies: 
    - name: template/remove_comments
      repository: local

  repositories:
    - name: local
      type: local

platforms:
  - type: nextflow

Run the pipeline

Now run the pipeline with Nextflow:

nextflow run . \
  -main-script main.nf
N E X T F L O W  ~  version 23.10.1
Launching `main.nf` [happy_boyd] DSL2 - revision: dc137fbfcf
[-        ] process > remove_comments:processWf:r... -
Input: [file1, [input:/home/runner/work/website/website/guide/_viash_project_template/resources_test/file1.tsv]]
Input: [file2, [input:/home/runner/work/website/website/guide/_viash_project_template/resources_test/file2.tsv]]

executor >  local (2)
[e4/fbfaba] process > remove_comments:processWf:r... [100%] 2 of 2 ✔
Input: [file1, [input:/home/runner/work/website/website/guide/_viash_project_template/resources_test/file1.tsv]]
Input: [file2, [input:/home/runner/work/website/website/guide/_viash_project_template/resources_test/file2.tsv]]
Output: [file1, [output:/home/runner/work/website/website/guide/_viash_project_template/work/7d/d419f8066b0a77f745cec0f58b246f/file1.remove_comments.output.tsv]]
Output: [file2, [output:/home/runner/work/website/website/guide/_viash_project_template/work/e4/fbfabae8b30ff82312dae82c9306f7/file2.remove_comments.output.tsv]]

executor >  local (2)
[e4/fbfaba] process > remove_comments:processWf:r... [100%] 2 of 2 ✔
Input: [file1, [input:/home/runner/work/website/website/guide/_viash_project_template/resources_test/file1.tsv]]
Input: [file2, [input:/home/runner/work/website/website/guide/_viash_project_template/resources_test/file2.tsv]]
Output: [file1, [output:/home/runner/work/website/website/guide/_viash_project_template/work/7d/d419f8066b0a77f745cec0f58b246f/file1.remove_comments.output.tsv]]
Output: [file2, [output:/home/runner/work/website/website/guide/_viash_project_template/work/e4/fbfabae8b30ff82312dae82c9306f7/file2.remove_comments.output.tsv]]
On the example data:
cat resources_test/file?.tsv
# this is a header      
# this is also a header     
one 0.11    123
two 0.23    456
three   0.35    789
four    0.47    123
# this is not a header
# just kidding yes it is
eins    0.111   234
zwei    0.222   234
drei    0.333   123
vier    0.444   123

This results in the following output:

tree output
output
├── file1.remove_comments.output.tsv -> /home/runner/work/website/website/guide/_viash_project_template/work/7d/d419f8066b0a77f745cec0f58b246f/file1.remove_comments.output.tsv
└── file2.remove_comments.output.tsv -> /home/runner/work/website/website/guide/_viash_project_template/work/e4/fbfabae8b30ff82312dae82c9306f7/file2.remove_comments.output.tsv

0 directories, 2 files
cat output/*
one 0.11    123
two 0.23    456
three   0.35    789
four    0.47    123
eins    0.111   234
zwei    0.222   234
drei    0.333   123
vier    0.444   123

Discussion

The above example pipeline serves as the backbone for creating real-life pipelines. However, for the sake of simplicity it contained several hardcoded elements that should be avoided:

  • Input parameters should be provided as an argument to the pipeline or as part of the pipeline configuration
  • The output directory should be specified as an argument to the pipeline

As illustrated earlier these come for free when encoding the workflow as a Viash component. One even gets parameter checks with it!