Creating an R Component

Developing a new Viash component using R.

In this tutorial, you’ll create a component that does the following:

  • Extract all hyperlinks from a markdown file
  • Check if every URL is reachable
  • Create a text report with the results

The component will be able to run locally and as a docker container. In order to create a component you need two files: a script for the functionality and a config file that describes the component.

The files used in this tutorial can be found here:

https://github.com/viash-io/viash_web/tree/main/static/examples/md_url_checker_r

Prerequisites

To follow along with this tutorial, you need to have this software installed on your machine:

We recommend you take a look at the hello world example first to understand how components work.

Write a script in R

The first step of developing this component, is writing the core functionality of the component, in this case an R script.
Create a new folder named my_viash_component and open it. Now create a new file named script.R in there and add this code as its content:

options(tidyverse.quiet = TRUE)
library(tidyverse)
library(httr, quietly = TRUE)
library(rvest, quietly = TRUE)

## VIASH START
par <- list(
  inputfile = "Testfile.md",
  domain = "https://viash.io",
  output = "output.txt"
)

## VIASH END

temp_html <- tempfile(fileext = ".html")
on.exit(file.remove(temp_html)) # remove tempfile after scripts exits to make sure it's always removed

# Convert the markdown file to html
rmarkdown::render(
  input = par$inputfile,
  output_format = "html_document",
  output_file = temp_html,
  quiet = TRUE,
  runtime = "static"
)
html <- rvest::read_html(temp_html)

cat("Extracting URLs\n")
urls <- html %>% html_elements("a") %>%
  html_attr("href")
titles <- html %>% html_elements("a") %>%
  html_text()

cat("Checking", length(urls), "URLs\n")
outputs <- map_df(seq_along(urls), function(i) {
  url <- urls[i]
  title <- titles[i]

  # If an URL doesn't start with 'http', add the domain before it
  if (!grepl("^https?://", url)) {
    url = paste0(par$domain, url)
  }

  output <- tibble(url, title)

  # Do a web request and get the status code
  output$status <-
    tryCatch({
      code <- status_code(GET(url))
      if (code == "200") {
        "OK"
      } else {
        paste0("ERROR! URL cannot be reached. Status code: ", code)
      }
    },
    error = function(cond) {
      "ERROR! URL does not seem to exist!"
    },
    warning = function(cond) {
      "ERROR! URL caused a warning!"
    })

  output
})

print(outputs)

content <- paste0(
  "Link name: ", outputs$title, "\n",
  "URL: ", outputs$url, "\n",
  outputs$status, "\n",
  "---"
)

write_lines(content, par$output)

cat("")
cat("Input '", par$inputfile, "' has been checked and a report named '", par$output, "' has been generated.\n", sep = "")
cat(sum(outputs$status != "OK"), " out of ", nrow(outputs), " URLs could not be reached.\n", sep = "")

Note the numbered comments scattered about looking like ### x ###, here’s a breakdown of the code:

  1. The variables are placed between ## VIASH START and ## VIASH END for debugging purposes, their final values will be dynamically generated by Viash once the script is turned into a component. If you want to skip the testing of your script, you can leave these out and Viash will create variables based on the configuration file. There are three variables inside of a list named par:
    • inputfile: The markdown file that needs to be parsed.
    • domain: The domain URL that gets inserted before any relative URLs. For example, “/documentation/intro” could be replaced with “https://my-website/documentation/intro” to create a valid URL.
    • output: The path of the output text file that will contain the report.
  2. The script converts the markdown file to html and extracts the URLs and titles for later use.
  3. Start a for-loop to iterate the hyperlinks.
  4. Any relative URLs (or those that don’t start with “http” at least) will get the domain added before it.
  5. A GET request is used to check for a response from the URL. The resulting status code is stored and compared to the expected code. The results get written to the terminal and the report.

Test the script

Before turning the script into a component, it’s a good idea to test if it actually works as expected.
As the script expects a markdown file with hyperlinks, create a new file in the script folder named Testfile.md and paste in the following:

# Test File

This is a simple markdown file with some hyperlinks to test if the check_if_URLS_reachable component works correctly.
Some links to websites:

- [Google](https://www.google.com)
- [Reddit](https://www.reddit.com)
- [A broken link](http://microsoft.com/random-link)

Links that are relative to [viash.io](http://www.viash.io):

- You can [install viash here](/guides/getting_started/installation).
- It all starts with a script and a [config file](/api/config/config) for your components.

Now open a terminal in the folder and execute the following command to run the R script:

Rscript script.R

The script will now show the following output:

System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down
Warning message:
In system("timedatectl", intern = TRUE) :
  running command 'timedatectl' had status 1

Attaching package: ‘rvest’

The following object is masked from ‘package:readr’:

    guess_encoding

[WARNING] This document format requires a nonempty <title> element.
  Please specify either 'title' or 'pagetitle' in the metadata,
  e.g. by using --metadata pagetitle="..." on the command line.
  Falling back to 'Testfile'
Extracting URLs
Checking 6 URLs
# A tibble: 6 × 3
  url                                                title              status  
  <chr>                                              <chr>              <chr>   
1 https://www.google.com                             Google             OK      
2 https://www.reddit.com                             Reddit             OK      
3 http://microsoft.com/random-link                   A broken link      ERROR! …
4 http://www.viash.io                                viash.io           OK      
5 https://viash.io/guides/getting_started/installation install viash here OK      
6 https://viash.io/api/config/config           config file        ERROR! …
Input 'Testfile.md' has been checked and a report named 'output.txt' has been generated.
2 out of 6 URLs could not be reached.

If you get this same output, that means the script is working as intended! Feel free to take a peek at the generated output.txt file as well. You might have noticed you didn’t have to provide any arguments, that’s because the values are hard-coded into the script for debugging purposes.

Now the script has been tested, it’s time to create a config file to describe the component based on it.

Describe the component using YAML

A viash config file is a YAML file that describes the behavior and supported platforms of a Viash component. Create new file named config.vsh.yaml and paste the following template inside of it:

functionality:
  name: NAME
  description: DESCRIPTION
  arguments:                     
  - type: string
    name: --input
    description: INPUT DESCRIPTION
  resources:
  - type: LANGUAGE_script
    path: SCRIPT
platforms:
  - type: native

Every config file requires these two dictionaries: functionality and platforms. This bare-bones config file makes it easy to “fill in the blanks” for this example. For more information about config files, take a look at the Config section of the API.

Let’s start off by defining the functionality of our component.

Defining the functionality

The functionality dictionary describes what the component does and the resources it needs to do so. The first key is name, this will be the name of the component once it’s built. Replace the NAME value with md_url_checker_r or any other name of your choosing.

Next up is the description key, its value will be printed out at the top when the –help command is called. Replace DESCRIPTION with “Check if URLs in a markdown are reachable and create a text report with the results.”. You can use multiple lines for a description by starting its value with a pipe (|) and a new line, like so:

functionality:
  name: md_url_checker_r
  description: |
    This is the first line of my description.
    Here's a second line!

The arguments dictionary contains all of the arguments that are accepted by the component. These arguments will be injected as variables in the script. In the case of the example script, this are the variables we’re working with:

  • inputfile
  • domain
  • output

To create good arguments, you need to ask yourself a few essential questions about each variable:

  • What is the most fitting data type?
  • Is it an input or an output?
  • Is it required?

Let’s take a closer look at inputfile for starters:

We know it’s a file, as the script needs the path to a markdown file as its input. It’s also definitely a required variable, as the script would be pointless without it.
With this in mind, modify the first argument as follows:

  • Change type’s value to file.
  • Set name’s value to –inputfile. The name of an argument has to match the variable name as the argument will be injected into the final script. In the case of r scripts, the variables are added to a list named par.
  • Use “The input markdown file.” for the description value. This description will be included when the –help option is called.
  • Add a new key named required and set its value to true. This ensures that the component will not be run without a value for this argument.
  • Add another key, name it must_exist and set its value to true. This key is unique to file type arguments, it adds extra logic to the component to check if a file exists before running the component. This saves you from having to do this check yourself in the script.

That’s it for the first argument! The result should look like this:

  - type: file
    name: --inputfile
    description: The input markdown file.
    required: true
    must_exist: true

Now for domain, this is a simple optional string that gets added before relative URLs. Make room for a new argument by creating a new line below must_exist: true and press Shift + Tab to back up one tab so the cursor is aligned with the start of the first argument. Add the --domain argument here:

  - type: string                           
    name: --domain
    description: The domain URL that gets inserted before any relative URLs. For example, "/documentation/intro" could be replaced with "https://my-website/documentation/intro" to create a valid URL.

If an argument isn’t required, you can simply omit the required key. Here’s what the arguments dictionary look like up until now:

  arguments:                     
  - type: file
    name: --inputfile
    description: The input markdown file.
    required: true
    must_exist: true
  - type: string                           
    name: --domain
    description: The domain URL that gets inserted before any relative URLs. For example, "/documentation/intro" could be replaced with "https://my-website/documentation/intro" to create a valid URL.

The final variable to create an argument for is output. This is another file and clearly an output. Its value isn’t required as we can use a default path if no explicit value is given.
Add yet another new argument with the following keys and values:

  • Add a type key and set file as its value.
  • The next key is name, use –output as its value.
  • For the description, use “The path of the output text file that will contain the report.”.
  • Add a new key and name it default. This will act as the default value when not specified by the user of the component. Set its value to “output.txt”, including the quotation marks.
  • Finally, add the direction key and set its value to output. This specifies the direction of an argument as either input or output, with input being the default. Specifying that an argument is an output is important so the component can correctly handle the writing of files and the passing of values in a pipeline.

The finished argument should look like this:

  - type: file                           
    name: --output
    description: The path of the output text file that will contain the report.
    default: "output.txt"
    direction: output

With that, there’s just one more part of the functionality to fill in: the script itself!
Every Viash component has one or more resources, the most important of which is often the script. The template already contains a resources dictionary, so replace the following values to point to the script:

  • Set the value of type to r_script. The script used in this case was written in R, so the resource type is set accordingly so Viash knows what flavor of code to generate to create the final component. You can find a full overview of the different resource types on the Functionality page.
  • Change the value of path to script.R. This points to the resource and can be a relative path, an absolute path or even a URL. In this case we keep the script in the same directory as the config file to keep things simple.

That finishes up the functionality side of the component! All that’s left is defining the platforms with their dependencies and then running and building the component.

Defining the platforms

The platforms dictionary specifies the requirements to execute the component on zero or more platforms. The list of currently supported platforms are Native, Docker, and Nextflow. If no platforms are specified, a native platform is assumed. Here’s a quick overview of the platforms:

  • native: The platform for developers that know what they’re doing or for simple components without any dependencies. All dependencies need to be installed on the system the component is run on.
  • docker: This platform is recommended for most components. The dependencies are resolved by using docker containers, either from scratch or by pulling one from a docker repository. This has huge benefits as the end user doesn’t need to have any of the dependencies installed locally.
  • nextflow: This converts the component into a NextFlow module that can be imported into a pipeline.

In this tutorial, we’ll take a look at both the native and docker platforms. The platforms are also defined in the config.vsh.yaml file at the very bottom. The native platform is actually already defined in the template, that one type key with a value of native is enough! Now for adding the docker platform, add a new line below the last and add the following:

  - type: docker
    image: rocker/tidyverse:latest

This tells Viash that this component can be built to a docker container with the a Rocker image including tidyverse as its base. If your script doesn’t depend on any packages, this would be all you’d have to add when using an R script. The script in our example however needs an extra package installed to work. Luckily, this isn’t a problem since Viash supports defining dependencies which then get pulled from inside the docker container before running the script. To add the dependencies that needs to be installed, add these lines below image: rocker/tidyverse:latest:

    setup:
    - type: apt
      packages:
        - pandoc

This will prompt the apt package manager to download and install pandoc inside of the container. That’s it for the config! Be sure to save it and let’s move on to actually running the component you’ve created. For reference, you can take a look at the completed config.vsh.yaml file in our Github repository.

Run the component

Time to run the component! First off, let’s see what the output of --help is. To do that, open a terminal in the my_viash_component folder and execute the following command:

viash run config.vsh.yaml -- --help

This will show the following:

md_url_checker_r <not versioned>
Check if URLs in a markdown are reachable and create a text report with the results.

Options:
   --inputfile
        type: file, required parameter, file must exist
        The input markdown file.

   --domain
        type: string
        The domain URL that gets inserted before any relative URLs. For example, "/documentation/intro" could be replaced with "https://my-website/documentation/intro" to create a valid URL.

   --output
        type: file, output
        default: output.txt
        The path of the output text file that will contain the report.

As you can see, the values you entered into the config file are all here.
Next, let’s run the component natively with some arguments. You can use one of your own markdown files as the input if you desire. In that case, replace Testfile.md in the command with the path to your file.
Execute the following command to run the component with the default platform, in this case native as it’s the first in the platforms dictionary:

viash run config.vsh.yaml -- --inputfile=Testfile.md --domain=https://viash.io/ --output=my_report.txt

If all goes well, you’ll see something like this output in the terminal and a file named my_report.txt will have appeared:

System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down
Warning message:
In system("timedatectl", intern = TRUE) :
  running command 'timedatectl' had status 1

Attaching package: ‘rvest’

The following object is masked from ‘package:readr’:

    guess_encoding

[WARNING] This document format requires a nonempty <title> element.
  Please specify either 'title' or 'pagetitle' in the metadata,
  e.g. by using --metadata pagetitle="..." on the command line.
  Falling back to 'Testfile'
Extracting URLs
Checking 6 URLs
# A tibble: 6 × 3
  url                                                 title              status 
  <chr>                                               <chr>              <chr>  
1 https://www.google.com                              Google             OK     
2 https://www.reddit.com                              Reddit             OK     
3 http://microsoft.com/random-link                    A broken link      ERROR!…
4 http://www.viash.io                                 viash.io           OK     
5 https://viash.io//docs/getting_started/installation install viash here OK     
6 https://viash.io//api/config/config           config file        ERROR!…
Input 'Testfile.md' has been checked and a report named 'my_report.txt' has been generated.
2 out of 6 URLs could not be reached.

For more information on the run command, take a look at the Viash run command page. Great! With that working, the next step is building an executable.

Building an executable

You can generate an executable using either the native or the docker platform. The former will generate a file that can be run locally, but depends on your locally installed software packages to work. A docker executable on the other hand can build and start up a docker container that handles the dependencies for you.
To create a native build, execute the following command:

viash build config.vsh.yaml

A new folder named output will have been created with an executable inside named md_url_checker_r. To test it out, execute the following command:

output/md_url_checker_r --inputfile=Testfile.md --domain=https://viash.io/ --output=my_report.txt

The output is the same as by running the component, but the executable can be easily shared and now includes the ability to feed arguments to it and an included --help command. Not bad!
Next up is the docker executable. You can specify the platform with the -p argument and choose an output folder using -o, apart from that it’s the same as the previous build command:

viash build -p docker -o docker_output config.vsh.yaml 

You’ll now have a docker_ouput folder alongside the output one. This folder also contains a file named md_url_checker_r, but its inner workings are slightly different than before. Run md_url_checker_r with the full arguments list to test what happens:

docker_output/md_url_checker_r --inputfile=Testfile.md --domain=https://viash.io/ --output=my_report.txt

Here’s what just happened:

  • If the docker image wasn’t found, Viash will download it.
  • A check is made to see if a container named “md_url_checker_r” exists. If not, one will be built with the image defined in the config as its base.
  • All dependencies defined in the config are taken care of.
  • The script is run with the passed arguments and the output is passed to your shell. The my_report.txt file is written to your working directory.

For more information about the viash build command, take a look at its command page. That concludes the building of executables based on components using Viash!

Writing and running a unit test

To finish off this tutorial, it’s important to talk about unit tests. To ensure that your component works as expected during its development cycle, writing one or more tests is essential. Luckily, writing a unit test for a Viash component is straightforward.

You just need to add test parameters in the config file and write a script which runs the executable and verifies the output. When running tests, Viash will automatically build an executable and place it alongside the other defined resources in a temporary working directory. To get started, open up config.vsh.yaml file again and add this at the end of the functionality dictionary, between the path: script.R and platforms: lines:

  tests:
  - type: r_script
    path: test.R
  - path: Testfile.md

This test dictionary contains a reference to the test script and all of the files that need to be copied over in order to complete a test. In the case of our example, test.R will be the test script and Testfile.md is necessary as an input markdown file is required for the script to function. Now create a new file named test.R in the my_viash_component folder and add this as its content:

library(testthat)
library(processx)

# check 1
cat(">>> Checking whether output is correct\n")
out <- processx::run("./md_url_checker_r", c("--inputfile", "Testfile.md", "--domain", "https://viash.io"))
expect_equal(out$status, 0)
expect_match(out$stdout, regexp = "https://www.google.com")
expect_match(out$stdout, regexp = "ERROR!")

# check 2
cat(">>> Checking whether output file is correct\n")
output_file <- paste(readLines("output.txt"), collapse="\n")
expect_match(output_file, regexp = "https://www.google.com")
expect_match(output_file, regexp = "ERROR! URL cannot be reached.")
expect_match(output_file, regexp = "Link name: install viash here")

cat(">>> Test finished successfully!\n")

This R script will run the component and perform several checks to its output using processx and testthat. A successful test runs all the way down and exits with “OK”.

  • processx::run() runs the component and writes its output to a string.
  • All of the expect_match() calls check if a certain piece of text could be found using regex.

Make sure both the config and test files are saved, then run a test by running this command:

viash test config.vsh.yaml 

The output will look like this:

Running tests in temporary directory: '/tmp/viash_test_md_url_checker_r1020558234470837119'
====================================================================
+/tmp/viash_test_md_url_checker_r1020558234470837119/test_test.R/test.R
>>> Checking whether output is correct
>>> Checking whether output file is correct
>>> Test finished successfully!
====================================================================
SUCCESS! All 1 out of 1 test scripts succeeded!
Cleaning up temporary directory

If the test succeeds it simply writes the full output to the shell. If there’s any issues, the script stops and an error message will appear in red. For more information on tests take a look at the viash test command page.

What’s next?

Now you’re ready to use Viash to creating components from your own scripts, check out the rest of our guides and the API section. Here are some good starting points: