Skip to Content
JobsRunning

Running

This page details the process of submitting your built Packages as Jobs to a cluster managed by Shoc Platform. The primary tool for this is the shoc run command.


The shoc run Command

As previously mentioned, the shoc run command is your main interface for initiating a job. It is intelligent enough to first ensure that a Package for your project is built. If no Package exists, or if your local code has changed since the last build, shoc run will automatically trigger the build process before submitting the job.

Working Directory: Both shoc build and shoc run commands operate on the current working directory by default. This means they look for build.shoc.yaml and run.shoc.yaml (if applicable) in the directory where you execute the command. You can override this behavior by using the -d or --dir optional command-line option to specify a different project directory.

# Run a job from a different directory shoc run -d /path/to/my/other-project

The run.shoc.yaml Manifest

To submit a job, in addition to your build.shoc.yaml (which defines the Package), you will typically use a run.shoc.yaml file. This manifest defines the specific parameters and configuration for your job submission.

Here is a comprehensive example of how a run.shoc.yaml manifest can look, along with explanations for each field:

# mandatory: the cluster for the job to be submitted on. This must match a cluster name configured in your workspace. cluster: my-cluster # optional: a friendly name for your job, useful for identification in the UI and CLI lists. name: my-first-shoc-job # optional: a description for your job, providing more context for future reference. description: "Training a small model on a single GPU using my-cluster." # optional: a set of labels to categorize your job for easier navigation, filtering, and organization. labels: ['training', 'project-x', 'gpu'] # optional: extra arguments to be passed directly to the entrypoint of your container. # These arguments are appended after the entrypoint command defined in your build.shoc.yaml. args: ['--epochs', '10', '--batch-size', '32'] # optional: job replications configuration for creating job arrays. # By default, a job is a single task (replicas: 1). array: # optional: the number of equivalent tasks (replicas) to submit within this single Shoc Job. replicas: 4 # optional: the name of the environment variable that will contain the index of the current task (0 to replicas-1). # Default: SHOC_JOB_ARRAY_INDEXER indexer: MY_TASK_INDEX # optional: the name of the environment variable that will contain the total number of tasks in the array. # Default: SHOC_JOB_ARRAY_COUNT counter: MY_TOTAL_TASKS # optional: secrets and environment variable overrides to inject into the job's container runtime. env: # optional: a list of secret names to fetch from Shoc Platform's secret manager and inject as environment variables. use: ['MY_API_KEY', 'DB_PASSWORD'] # optional: a map of custom environment variables to override or define directly within the job's container. override: CUSTOM_MODEL_PATH: '/data/models' DEBUG_MODE: 'true' # optional: the amount of resources to be dedicated to each replica (task) of the job. # These are requests for resources that Kubernetes will try to satisfy. resources: # optional: the amount of CPU requested. e.g., '1000m' for 1 CPU core, '0.5' for half a core. cpu: 1000m # optional: the amount of memory requested. e.g., '1Gi' for 1 gigabyte, '512Mi' for 512 megabytes. memory: 1Gi # optional: the amount of Nvidia GPU requested. This requires the NVIDIA GPU Operator to be installed on the cluster. nvidiaGpu: 1 # optional: job runtime-specific configurations. # This section is used for advanced tuning depending on the Package's runtime type (e.g., 'mpi'). spec: # This field would contain specific configuration for an MPI job. mpi: # ... (see MPI specific configuration below)

As you can see, the run.shoc.yaml manifest offers granular control over various aspects of your job submission, from basic cluster selection to detailed resource allocation and environment variable management. However, for the simplest function runtime jobs, specifying just the cluster might be sufficient.

Runtime-Specific Configuration (spec section)

While the top-level run.shoc.yaml fields cover general job parameters, the spec section allows for fine-tuning based on the job’s runtime type (e.g., function or mpi). For function runtimes, typically no spec configuration is needed. However, for more complex scenarios, particularly MPI workloads, the spec section becomes crucial for optimal resource allocation and job behavior.

Example: MPI Runtime Specification

When running a Package built with an mpi template, you can define specific parameters for the MPI runtime within the spec.mpi section of your run.shoc.yaml:

# optional: the job runtime specification based on the type spec: # optional: only if using MPI runtime, this block configures MPI-specific parameters. mpi: # optional: configure the worker 'nodes' where your MPI tasks (processes) will run. workers: # optional: you may specify the exact number of worker pods (nodes) you need. # Leave blank for dynamic allocation based on total resources or defaults. replicas: 1 # optional: the amount of resources to dedicate to each worker pod. # If not given, resources will be deduced from the top-level 'resources' definition or system defaults. resources: cpu: 4 # Request 4 CPU cores for each worker. memory: 4Gi # Request 4 GB of memory for each worker. # optional: configure a dedicated launcher 'node' for your MPI task. # The launcher typically coordinates the MPI processes. launcher: # optional: set to 'true' to use a dedicated pod solely for the MPI launcher. # If 'false' or omitted, one of the worker pods will also serve as the launcher. dedicated: true # optional: specify resources for your launcher node. resources: cpu: 1 # Request 1 CPU core for the launcher. memory: 1Gi # Request 1 GB of memory for the launcher.

Resource Allocation Note: Resource allocation within MPI jobs can be managed through a combination of the top-level resources definition and the more granular spec.mpi.workers.resources and spec.mpi.launcher.resources. If resources are not explicitly defined at any level, Shoc Platform will apply sensible defaults. The platform provides flexibility to allocate resources either statically (by specifying exact values) or dynamically (by omitting values and letting the system deduce based on available capacity or configured limits).

Last updated on