Creating a Scripted Data Pipeline
In this guide you’ll create a simple pipeline using a Viash namespace and Bash.
Creating a namespace
A namespace is a group of components that can be used in a pipeline.
Creating components
The first step is getting the folder structure right. Create a new folder named scripted_pipeline and create another folder in there named src. Now create these folders inside of the src folder:
- 1_Text_Replace
- 2_Add_Metadata
- 3_Post_To_Pastebin
Next, add an empty config.vsh.yaml and a script.sh file to each of these folders. Here’s what your src folder structure should look like:
src
├── 1_Text_Replace
│ ├── config.vsh.yaml
│ └── script.sh
├── 2_Add_Metadata
│ ├── config.vsh.yaml
│ └── script.sh
└── 3_Post_To_Pastebin
├── config.vsh.yaml
└── script.sh
Each component will act as a modular part in the pipeline, doing a specific job:
- The first component takes a text file as its input and replaces all occurrences of “Lorem Ipsum” with “Foo Bar”. The output is saved to another text file.
- The next component also takes a text file as its input and, copies it to an output file and adds some dummy metadata.
- The last component uploads the contents of a file to a pastebin website and writes away the URL in a text file.
1_Text_Replace
Next, open script.sh and add the following to its contents:
#!/bin/bash
## VIASH START
par_input="input.txt"
par_search="Lorem Ispum"
par_replace="Foo Bar"
par_output="output.txt"
## VIASH END
cp $par_input $par_output
if [[ $par_search= != "" && $par_replace= != "" ]]; then
sed -i -e "s/$par_search/$par_replace/g" $par_output
fi
Finally, add this configuration yaml to config.vsh.yaml:
functionality:
name: text_replace
description: Replace all occurrences of a certain piece of text with another.
arguments:
- type: file
name: input
must_exist: true
default: input.txt
- type: string
name: --search
required: true
- type: string
name: --replace
required: true
- type: file
name: --output
direction: output
default: output.txt
resources:
- type: bash_script
path: script.sh
platforms:
- type: native
- type: docker
image: bash:latest
The text replace component has the following arguments:
- input: The input file that needs text replaced
- –search: The string to search for
- –replace: The string to replace the found string with
- –output: Path where the output file gets saved to
2_Add_Metadata
Next, in the 2_Add_Metadata folder, replace the contents of script.sh with the following:
#!/bin/bash
## VIASH START
par_input="input.txt"
par_author="Me"
par_license="MIT"
par_description="This is a test file."
par_output="output.txt"
## VIASH END
cp $par_input $par_output
line1="# Author: $par_author"
line2="# License: $par_license"
line3="# Description: $par_description"
metadata="$line1\n$line2\n$line3\n\n"
sed -i "1s/^/$metadata/" $par_output
Now add this to config.vsh.yaml:
functionality:
name: add_metadata
description: Add metadata to the top of a file.
arguments:
- type: file
name: input
must_exist: true
default: input.txt
- type: string
name: --author
required: true
- type: string
name: --license
required: true
- type: string
name: --description
required: true
- type: file
name: --output
direction: output
default: output.txt
resources:
- type: bash_script
path: script.sh
platforms:
- type: native
- type: docker
image: bash:latest
The metadata component has these arguments:
- input: The input file to use as the base file
- –author: An author string
- –license: A license string
- –description: A description string
- –output: The file path where the the finished file with the metadata added to the top will be saved to
3_Post_To_Pastebin
Navigate to the 3_Post_To_Pastebin folder and addd this to script.sh:
#!/bin/bash
## VIASH START
par_input="input.txt"
par_pastebin_url="dpaste.com"
par_output="output.txt"
## VIASH END
pastebinit -i $par_input -b $par_pastebin_url > $par_output
Next, add this to config.vsh.yaml:
functionality:
name: post_to_pastebin
description: Upload the contents of a file to a pastebin using pastebinit.
arguments:
- type: file
name: input
must_exist: true
default: input.txt
- type: string
name: --pastebin_url
default: "dpaste.com"
- type: file
name: --output
direction: output
default: output.txt
resources:
- type: bash_script
path: script.sh
platforms:
- type: native
- type: docker
image: ubuntu:latest
setup:
- type: apt
packages: [ ca-certificates, pastebinit ]
The pastebin upload component has the following arguments:
- input: The file to upload
- –pastebin_url: The URL to upload to. By default this is dpaste.com as it allows anonymous uploads without verification.
- –output: Path to the file where the URL to the pastebin entry will be stored
This component supports using the docker platform like the other components, but since it needs pastebinit to work, an Ubuntu image is used. Two packages will need to be installed via apt, these are the prerequisites for pastebinit to work: ca-certificates and pastebinit. For more information on the setup of the Docker platform, take a look at the Docker platform page.
Building the namespace
With the definition of the components done, the next step is building them all in bulk using a namespace. Open a shell in the scripted_pipeline folder (should be one folder up from the src folder) and execute the following command:
viash ns build -l -p docker -f -t bin
Here’s what this ns build command does:
- -l: Build all of the components found in the src folder in parallel.
- -p docker: Only build for the Docker platform so we don’t need the dependencies on our local machine.
- -f: Flatten the result to a single folder with all the executables in them.
- -t bin: Output the results to a bin folder. The default output folder is target.
Here’s what the full folder structure should look like now:
scripted_pipeline/
├── bin
│ ├── add_metadata
│ ├── text_replace
│ └── upload_to_pastebin
└── src
├── 1_Text_Replace
│ ├── config.vsh.yaml
│ ├── input.txt
│ └── script.sh
├── 2_Add_Metadata
│ ├── config.vsh.yaml
│ └── script.sh
└── 3_Post_To_Pastebin
├── config.vsh.yaml
└── script.sh
Your first pipeline
To “glue” the components together and create a pipeline, you’ll need to write a Bash script. Create a new file named scripted_pipeline.sh in the scripted_pipeline folder and open it in a file editor. Add this code and save the file:
#!/bin/bash
input_dir="input" # Place files in this folder
output_dir="output"
text_replace="bin/text_replace"
add_metadata="bin/add_metadata"
post_to_pastebin="bin/upload_to_pastebin"
mkdir -p "$output_dir"
# Get all files in the input folder and iterate over them
for file in $input_dir/*; do
echo "Processing $file:"
echo ""
file_base=$(basename $file) # Get the filename without the extension
echo "Replacing text...."
$text_replace $file \
--search "Lorem Ipsum" \
--replace "Viash" \
--output $output_dir/1_replaced_$file_base.txt
echo "Done!"
echo "Adding metadata..."
$add_metadata $output_dir/1_replaced_$file_base.txt \
--author "Me" \
--license "MIT" \
--description "A test file" \
--output $output_dir/2_added_metadata_$file_base.txt
echo "Done!"
echo "Uploading to pastebin..."
$post_to_pastebin $output_dir/2_added_metadata_$file_base.txt \
--output $output_dir/3_pastebin_url_$file_base.txt
echo "Done!"
echo ""
echo "Pastebin url for $file:"
cat $output_dir/3_pastebin_url_$file_base.txt
echo ""
done
echo "Finished processing all files"
This script does the following:
- Make an output directory.
- Iterate over all the files in the input directory and perform the actions below.
- Replace every mention of “Lorem Ipsum” with “Viash” and save the output to a file.
- Take the output of the previous step and add some test metadata. Save the output of this action to a file as well.
- Upload the contents of the last output file to a pastebin and save the URL to a file.
- Print the URL to the terminal.
As you can see, each action gets the output of the previous step, processes it and creates a new output.
Running the pipeline
Create a folder named input inside of the scripted_pipeline folder. Next, download and save this input file to that folder:
This is a small text file containing an explanation on the placeholder
text Lorem Ipsum.
Rename the downloaded file to first_file.txt and create a copy
named second_line.txt.
Now execute this command in the scripted_pipeline folder to test out the pipeline:
bash scripted_pipeline.sh
The output will look like this:
Processing input/first_file.txt:
Replacing text....
Done!
Adding metadata...
Done!
Uploading to pastebin...
Done!
Pastebin url for input/first_file.txt:
http://dpaste.com//XXXXXX
Processing input/second_file.txt:
Replacing text....
Done!
Adding metadata...
Done!
Uploading to pastebin...
Done!
Pastebin url for input/second_file.txt:
http://dpaste.com//XXXXXX
Finished processing all files
What’s next?
That concludes this guide! You can learn more about the discussed topics on these pages: