Back to home

Run a vLLM Server on HF Jobs in One Command

Learn how to launch a vLLM inference server on Hugging Face Jobs with a single command. This guide covers setup, configuration, and practical tips for deploying local models efficiently.

Audio reading is not available in this browser
Run a vLLM Server on HF Jobs in One Command

Tags

Quick summary

Learn how to launch a vLLM inference server on Hugging Face Jobs with a single command. This guide covers setup, configuration, and practical tips for deploying local models efficiently.

Run a vLLM Server on HF Jobs in One Command

Deploying large language models (LLMs) for inference is a critical step in moving from experimentation to production. The vLLM inference engine has emerged as a powerful tool to serve models efficiently, leveraging advanced batching and memory management. Hugging Face Jobs (HF Jobs) provides a scalable, cloud-native environment to run these inference workloads. This article shows you how to combine these two technologies and launch a vLLM server with a single command, streamlining your deployment pipeline.

What You Will Learn

By the end of this article, you will understand:

  • How vLLM accelerates LLM inference.
  • The prerequisites for using HF Jobs.
  • Step-by-step installation and configuration of a vLLM server.
  • Practical commands to run the server with one command.
  • Real-world usage examples and best practices.

Background: vLLM and HF Jobs

**vLLM** is an open-source inference engine designed for high throughput and low latency. It uses techniques like PagedAttention to manage key-value cache memory efficiently, enabling you to serve models like Llama, Mistral, and others with minimal overhead.

**Hugging Face Jobs** is a managed service that runs your code on scalable cloud infrastructure. It integrates seamlessly with the Hugging Face Hub, allowing you to pull models directly and execute commands without managing servers yourself. The platform supports GPU instances, making it ideal for LLM inference.

Combining vLLM with HF Jobs gives you a serverless-like experience: you specify a model, a command, and the resources needed, and the platform handles the rest.

Requirements

Before proceeding, ensure you have the following:

1. **A Hugging Face Account** – Sign up at [huggingface.co](https://huggingface.co). You’ll need an access token with write permissions to create jobs. 2. **HF CLI Installed** – Install the Hugging Face CLI locally to interact with the Hub and Jobs API. Use `pip install huggingface-hub`. 3. **A GPU-compatible Model** – vLLM works best with models like Mistral 7B, Llama 2/3, or any model supported by Hugging Face Transformers. Ensure the model is publicly accessible or you have granted access. 4. **Basic Familiarity with Docker and CLI** – HF Jobs uses Docker containers, and you will run commands from your terminal.

Step-by-Step Installation and Configuration

1. Set Up Your Environment

First, install the Hugging Face CLI and authenticate:

pip install huggingface-hub
huggingface-cli login

Enter your access token when prompted. This token is stored locally and used for all subsequent API calls.

2. Understand HF Jobs Structure

HF Jobs require a job definition. The simplest way is to use the `huggingface_hub` Python SDK or the CLI. For a single-command launch, we’ll use the CLI directly.

A job typically includes:

  • `--name`: A job name.
  • `--type`: The job type (e.g., `inference`).
  • `--image`: A Docker image with vLLM installed.
  • `--command`: The command to run inside the container.
  • `--resources`: GPU requirements (e.g., `gpu=1`, `memory=16Gi`).

3. Choose or Build a Docker Image

You can use the official vLLM Docker image from Docker Hub or build your own. For simplicity, we’ll use `vllm/vllm-openai:latest` which includes the OpenAI-compatible API server.

Check the latest version at the [vLLM GitHub](https://github.com/vllm-project/vllm) or use a specific tag like `v0.4.0`.

4. Write a Simple Python Entry Point (Optional)

If you need custom logic, you can write a Python script. But for a one-command launch, the vLLM server itself is the entry point. The command will be:

python -m vllm.entrypoints.openai.api_server --model <model-name> --port 8000

5. Launch the Job with One Command

Now, combine everything into a single HF Jobs command. Replace `<your-model>` with the model ID (e.g., `mistralai/Mistral-7B-Instruct-v0.2`).

huggingface-cli jobs create \
  --name vllm-server \
  --type inference \
  --image vllm/vllm-openai:latest \
  --command "python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --port 8000" \
  --resources gpu=1,memory=16Gi

**Explanation:**

  • `--name`: Your job’s display name.
  • `--type`: `inference` indicates a GPU inference job.
  • `--image`: The Docker image containing vLLM and dependencies.
  • `--command`: The exact command to run. vLLM’s API server starts on port 8000.
  • `--resources`: Requests one GPU and 16 GB of memory (adjust based on model size).

After running this, HF Jobs will: 1. Pull the Docker image. 2. Allocate a GPU instance. 3. Run the command inside the container. 4. Expose an endpoint (you’ll see the URL in the job logs).

6. Monitor the Job

Check the job status and logs:

huggingface-cli jobs logs --name vllm-server

Once the server is ready, you’ll see output like `Uvicorn running on http://0.0.0.0:8000`. The actual public URL will be provided in the job details.

Usage Examples

Example 1: Basic Text Generation

Once the server is running, you can send requests to the OpenAI-compatible endpoint. Use `curl` or any HTTP client:

curl http://<job-url>:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "What is the capital of France?",
    "max_tokens": 50
  }'

**Response:**

{
  "id": "cmpl-...",
  "object": "text_completion",
  "choices": [{"text": " The capital of France is Paris."}]
}

Example 2: Chat Completions

For chat-based models, use the `/v1/chat/completions` endpoint:

curl http://<job-url>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "temperature": 0.7
  }'

Example 3: Scaling with Multiple GPUs

For larger models like Llama 3 70B, you might need multiple GPUs. Use tensor parallelism:

huggingface-cli jobs create \
  --name vllm-llama3 \
  --type inference \
  --image vllm/vllm-openai:latest \
  --command "python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 4 --port 8000" \
  --resources gpu=4,memory=64Gi

Here, `--tensor-parallel-size 4` splits the model across 4 GPUs.

Example 4: Using Environment Variables

You can pass environment variables for configuration, such as API keys or custom settings:

huggingface-cli jobs create \
  --name vllm-custom \
  --type inference \
  --image vllm/vllm-openai:latest \
  --command "python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --port 8000" \
  --resources gpu=1,memory=16Gi \
  --env HF_TOKEN=your_token_here

**Note:** Use this carefully; tokens are visible in job logs.

Best Practices and Troubleshooting

Choosing the Right Model

  • **Model size vs. GPU memory:** A 7B model fits on a single 16GB GPU (e.g., T4, V100). 13B models need ~24GB. 70B models require multiple GPUs.
  • **Quantization:** Use vLLM’s `--dtype half` or `--quantization awq` to reduce memory. For example:
--command "python -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --quantization awq --port 8000"

Handling Timeouts

The first run may take several minutes to download the model. Set a longer timeout in your client if needed.

Security

  • Do not expose your HF token in the job command. Use environment variables or a secrets manager.
  • The server endpoint is public by default. For production, add authentication (e.g., via a reverse proxy).

Logs and Debugging

If the job fails, check logs:

huggingface-cli jobs logs --name vllm-server --tail 50

Common errors:

  • **Out of memory:** Increase GPU memory or use a smaller model.
  • **Model not found:** Ensure the model ID is correct and accessible.
  • **Port conflict:** Change the port with `--port <new-port>`.

Conclusion

Running a vLLM server on HF Jobs with a single command simplifies the deployment of LLM inference workloads. By leveraging vLLM’s efficient engine and HF Jobs’ managed infrastructure, you can serve models like Mistral and Llama in minutes without managing servers. The key steps are:

1. Authenticate with the Hugging Face CLI. 2. Use a pre-built vLLM Docker image. 3. Write a concise command that starts the API server. 4. Launch the job with GPU resources.

This approach is ideal for prototyping, testing, and even production-scale inference, especially when combined with scaling features like tensor parallelism. As the ecosystem evolves—with updates from Mistral AI, Meta AI, and the Ollama community—vLLM and HF Jobs will remain central to efficient model serving.

Try it today: pick a model, run the command, and start generating text in minutes.

Sources

FAQ

What is this article about?

This article covers “Run a vLLM Server on HF Jobs in One Command” in the Local models category. Learn how to launch a vLLM inference server on Hugging Face Jobs with a single command. This guide covers setup, configuration, and practical tips for deploying local models efficiently.

Who is this useful for?

It is useful for readers who want a practical understanding of AI tools, models, and workflows.

What should I do next?

Read the article, review the listed sources, and test the most relevant ideas in your own workflow.