Creating a Scripted Data Pipeline

In this guide you’ll create a simple pipeline using a Viash namespace and Bash.

Creating a namespace

A namespace is a group of components that can be used in a pipeline.

Creating components

The first step is getting the folder structure right. Create a new folder named scripted_pipeline and create another folder in there named src. Now create these folders inside of the src folder:

  • 1_Text_Replace
  • 2_Add_Metadata
  • 3_Post_To_Pastebin

Next, add an empty config.vsh.yaml and a script.sh file to each of these folders. Here’s what your src folder structure should look like:

src
├── 1_Text_Replace
│   ├── config.vsh.yaml
│   └── script.sh
├── 2_Add_Metadata
│   ├── config.vsh.yaml
│   └── script.sh
└── 3_Post_To_Pastebin
    ├── config.vsh.yaml
    └── script.sh

Each component will act as a modular part in the pipeline, doing a specific job:

  • The first component takes a text file as its input and replaces all occurrences of “Lorem Ipsum” with “Foo Bar”. The output is saved to another text file.
  • The next component also takes a text file as its input and, copies it to an output file and adds some dummy metadata.
  • The last component uploads the contents of a file to a pastebin website and writes away the URL in a text file.

1_Text_Replace

Next, open script.sh and add the following to its contents:

#!/bin/bash

## VIASH START
par_input="input.txt"
par_search="Lorem Ispum"
par_replace="Foo Bar"
par_output="output.txt"
## VIASH END

cp $par_input $par_output
if [[ $par_search= != "" && $par_replace= != "" ]]; then
sed -i -e "s/$par_search/$par_replace/g" $par_output
fi

Finally, add this configuration yaml to config.vsh.yaml:

functionality:
  name: text_replace
  description: Replace all occurrences of a certain piece of text with another.
  arguments:
  - type: file
    name: input
    must_exist: true
    default: input.txt
  - type: string
    name: --search
    required: true
  - type: string
    name: --replace
    required: true
  - type: file
    name: --output
    direction: output
    default: output.txt
  resources:
  - type: bash_script
    path: script.sh
platforms:
  - type: native
  - type: docker
    image: bash:latest

The text replace component has the following arguments:

  • input: The input file that needs text replaced
  • –search: The string to search for
  • –replace: The string to replace the found string with
  • –output: Path where the output file gets saved to

2_Add_Metadata

Next, in the 2_Add_Metadata folder, replace the contents of script.sh with the following:

#!/bin/bash

## VIASH START
par_input="input.txt"
par_author="Me"
par_license="MIT"
par_description="This is a  test file."
par_output="output.txt"
## VIASH END

cp $par_input $par_output

line1="# Author: $par_author"
line2="# License: $par_license"
line3="# Description: $par_description"

metadata="$line1\n$line2\n$line3\n\n"

sed -i "1s/^/$metadata/" $par_output

Now add this to config.vsh.yaml:

functionality:
  name: add_metadata
  description: Add metadata to the top of a file.
  arguments:
  - type: file
    name: input
    must_exist: true
    default: input.txt
  - type: string
    name: --author
    required: true
  - type: string
    name: --license
    required: true
  - type: string
    name: --description
    required: true
  - type: file
    name: --output
    direction: output
    default: output.txt
  resources:
  - type: bash_script
    path: script.sh
platforms:
  - type: native
  - type: docker
    image: bash:latest

The metadata component has these arguments:

  • input: The input file to use as the base file
  • –author: An author string
  • –license: A license string
  • –description: A description string
  • –output: The file path where the the finished file with the metadata added to the top will be saved to

3_Post_To_Pastebin

Navigate to the 3_Post_To_Pastebin folder and addd this to script.sh:

#!/bin/bash

## VIASH START
par_input="input.txt"
par_pastebin_url="dpaste.com"
par_output="output.txt"
## VIASH END

pastebinit -i $par_input -b $par_pastebin_url > $par_output

Next, add this to config.vsh.yaml:

functionality:
  name: post_to_pastebin
  description: Upload the contents of a file to a pastebin using pastebinit.
  arguments:
  - type: file
    name: input
    must_exist: true
    default: input.txt
  - type: string
    name: --pastebin_url
    default: "dpaste.com"
  - type: file
    name: --output
    direction: output
    default: output.txt
  resources:
  - type: bash_script
    path: script.sh
platforms:
  - type: native
  - type: docker
    image: ubuntu:latest
    setup:
      - type: apt
        packages: [ ca-certificates, pastebinit ]
      

The pastebin upload component has the following arguments:

  • input: The file to upload
  • –pastebin_url: The URL to upload to. By default this is dpaste.com as it allows anonymous uploads without verification.
  • –output: Path to the file where the URL to the pastebin entry will be stored

This component supports using the docker platform like the other components, but since it needs pastebinit to work, an Ubuntu image is used. Two packages will need to be installed via apt, these are the prerequisites for pastebinit to work: ca-certificates and pastebinit. For more information on the setup of the Docker platform, take a look at the Docker platform page.

Building the namespace

With the definition of the components done, the next step is building them all in bulk using a namespace. Open a shell in the scripted_pipeline folder (should be one folder up from the src folder) and execute the following command:

viash ns build -l -p docker -f -t bin

Here’s what this ns build command does:

  • -l: Build all of the components found in the src folder in parallel.
  • -p docker: Only build for the Docker platform so we don’t need the dependencies on our local machine.
  • -f: Flatten the result to a single folder with all the executables in them.
  • -t bin: Output the results to a bin folder. The default output folder is target.

Here’s what the full folder structure should look like now:

scripted_pipeline/
├── bin
│   ├── add_metadata
│   ├── text_replace
│   └── upload_to_pastebin
└── src
    ├── 1_Text_Replace
    │   ├── config.vsh.yaml
    │   ├── input.txt
    │   └── script.sh
    ├── 2_Add_Metadata
    │   ├── config.vsh.yaml
    │   └── script.sh
    └── 3_Post_To_Pastebin
        ├── config.vsh.yaml
        └── script.sh

Your first pipeline

To “glue” the components together and create a pipeline, you’ll need to write a Bash script. Create a new file named scripted_pipeline.sh in the scripted_pipeline folder and open it in a file editor. Add this code and save the file:

#!/bin/bash

input_dir="input" # Place files in this folder
output_dir="output"
text_replace="bin/text_replace"
add_metadata="bin/add_metadata"
post_to_pastebin="bin/upload_to_pastebin"

mkdir -p "$output_dir"

# Get all files in the input folder and iterate over them
for file in $input_dir/*; do
    echo "Processing $file:"
    echo ""

    file_base=$(basename $file) # Get the filename without the extension

    echo "Replacing text...."
    $text_replace   $file \
                    --search "Lorem Ipsum" \
                    --replace "Viash" \
                    --output $output_dir/1_replaced_$file_base.txt
    echo "Done!"

    echo "Adding metadata..."
    $add_metadata   $output_dir/1_replaced_$file_base.txt \
                    --author "Me" \
                    --license "MIT" \
                    --description "A test file" \
                    --output $output_dir/2_added_metadata_$file_base.txt
    echo "Done!"

    echo "Uploading to pastebin..."
    $post_to_pastebin   $output_dir/2_added_metadata_$file_base.txt \
                        --output $output_dir/3_pastebin_url_$file_base.txt
    echo "Done!"

    echo ""
    echo "Pastebin url for $file:"

    cat $output_dir/3_pastebin_url_$file_base.txt
    echo ""
done

echo "Finished processing all files"

This script does the following:

  • Make an output directory.
  • Iterate over all the files in the input directory and perform the actions below.
  • Replace every mention of “Lorem Ipsum” with “Viash” and save the output to a file.
  • Take the output of the previous step and add some test metadata. Save the output of this action to a file as well.
  • Upload the contents of the last output file to a pastebin and save the URL to a file.
  • Print the URL to the terminal.

As you can see, each action gets the output of the previous step, processes it and creates a new output.

Running the pipeline

Create a folder named input inside of the scripted_pipeline folder. Next, download and save this input file to that folder:

Download input file

This is a small text file containing an explanation on the placeholder text Lorem Ipsum.
Rename the downloaded file to first_file.txt and create a copy named second_line.txt.

Now execute this command in the scripted_pipeline folder to test out the pipeline:

bash scripted_pipeline.sh 

The output will look like this:

Processing input/first_file.txt:

Replacing text....
Done!
Adding metadata...
Done!
Uploading to pastebin...
Done!

Pastebin url for input/first_file.txt:
http://dpaste.com//XXXXXX

Processing input/second_file.txt:

Replacing text....
Done!
Adding metadata...
Done!
Uploading to pastebin...
Done!

Pastebin url for input/second_file.txt:
http://dpaste.com//XXXXXX

Finished processing all files

What’s next?

That concludes this guide! You can learn more about the discussed topics on these pages: