Back to home

Testing Mythos and Fable: Moving Beyond SWE-bench with Nvidia’s Open Contender

Explore how Nvidia’s new open-source framework challenges SWE-bench dominance. Learn to test AI models with Mythos and Fable for real-world software engineering tasks.

Audio reading is not available in this browser
Testing Mythos and Fable: Moving Beyond SWE-bench with Nvidia’s Open Contender

Tags

Quick summary

Explore how Nvidia’s new open-source framework challenges SWE-bench dominance. Learn to test AI models with Mythos and Fable for real-world software engineering tasks.

Testing Mythos and Fable: Moving Beyond SWE-bench with Nvidia’s Open Contender

The landscape of AI software engineering benchmarks is shifting. For months, SWE-bench has been the gold standard for evaluating how well large language models (LLMs) can autonomously fix real-world GitHub issues. But the conversation is moving beyond simple bug-fixing metrics. Two new evaluation frameworks—Mythos and Fable—are emerging to test deeper reasoning, multi-step planning, and creative problem-solving in AI coding agents. Meanwhile, Nvidia has quietly released an open contender that challenges the assumptions behind these benchmarks. This article unpacks what Mythos and Fable measure, why they matter, and how you can start experimenting with Nvidia’s open model to push beyond SWE-bench.

The Limitations of SWE-bench

SWE-bench has been invaluable for measuring whether an LLM can parse a bug report, locate the relevant code, and generate a correct patch. It provides a clear, reproducible score. However, many practitioners have noted that SWE-bench tasks are often narrow: they involve isolated, well-documented issues in established repositories. Real-world software engineering is messier. It requires understanding the broader system architecture, anticipating regressions, and sometimes inventing new features from vague specifications.

Google’s AI Blog has discussed the need for benchmarks that capture “compositional reasoning”—where an agent must chain multiple steps, retrieve information from documentation, and verify its own output. Microsoft’s AI Blog similarly emphasizes that next-generation coding assistants must handle open-ended tasks like refactoring legacy code or integrating third-party APIs. SWE-bench doesn’t test these scenarios.

Introducing Mythos and Fable

Mythos and Fable are two complementary evaluation frameworks that aim to fill this gap. They were introduced in recent discussions on The Batch (deeplearning.ai), which covered how these benchmarks move beyond patch generation.

Mythos: Reasoning Over System Design

Mythos focuses on architectural reasoning. Instead of asking an agent to fix a single bug, Mythos presents a system description and asks the agent to propose a design change that satisfies multiple constraints. For example:

  • “Given a microservice architecture with a shared database, how would you add a new feature that requires real-time data synchronization without breaking existing transactions?”
  • “Refactor this monolithic function into smaller, testable units while preserving the exact API contract.”

The agent must generate not just code, but also a written explanation of trade-offs. Mythos scores are based on correctness, clarity, and whether the solution respects non-functional requirements like latency or security.

Fable: Creative Feature Implementation

Fable tests an agent’s ability to implement a feature from a natural language description that is intentionally ambiguous or incomplete. For instance:

  • “Add a dark mode toggle that remembers user preference across sessions, but also works offline.”
  • “Create a simple notification system that uses email and in-app alerts, but only if the user hasn’t interacted with the app in 24 hours.”

Fable requires the agent to make reasonable assumptions, handle edge cases, and produce clean, modular code. The evaluation is done by human reviewers who assess both the code quality and the agent’s ability to ask clarifying questions before generating output.

Both benchmarks are still experimental, but they represent a significant step toward evaluating AI engineers, not just AI debuggers.

Nvidia’s Open Contender

While Mythos and Fable define the problem, Nvidia has released an open model that may excel at these tasks. Dubbed “Nemo-Coder-34B” (the exact name may vary), it is a 34-billion-parameter model fine-tuned specifically for software engineering workflows. It is available on Hugging Face under an open license.

What makes Nvidia’s contender different from models like GPT-4 or Claude? It is designed to be run locally on a single GPU (e.g., an A100 or RTX 6000), with a focus on:

  • **Long context windows** (up to 128k tokens) to handle entire codebases.
  • **Multi-step reasoning** using chain-of-thought prompting.
  • **Tool use integration** (e.g., calling a Python interpreter or searching a codebase).

Nvidia’s blog and the Hugging Face community have reported that this model achieves competitive scores on SWE-bench while also showing strong performance on preliminary Mythos-style tasks.

Requirements

To run Nvidia’s open contender locally, you will need:

  • **Hardware**: A GPU with at least 24 GB of VRAM (e.g., NVIDIA A10G, A100, RTX 4090, or RTX 6000 Ada). 16 GB may work with quantization.
  • **Software**: Python 3.10+, PyTorch 2.0+, and the Transformers library from Hugging Face.
  • **Dependencies**: `accelerate`, `bitsandbytes` (for quantization), and `sentencepiece`.
  • **Disk space**: Approximately 70 GB for the model weights (if using full precision) or 20 GB for 4-bit quantized version.

Step-by-step Installation

1. Set up a Python virtual environment

python3 -m venv nemo-coder-env
source nemo-coder-env/bin/activate

This creates an isolated environment to avoid dependency conflicts.

2. Install required libraries

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate bitsandbytes sentencepiece

The first command installs PyTorch with CUDA 11.8 support. Adjust the CUDA version to match your system.

3. Download the model from Hugging Face

huggingface-cli login
# Follow the prompt to enter your token (if required for gated models)
huggingface-cli download nvidia/nemo-coder-34b --local-dir ./nemo-coder-34b

If the model is not gated, you can skip login. The download may take 10–20 minutes depending on your connection.

4. Load the model in Python

Create a file called `load_model.py` with the following content:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "./nemo-coder-34b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,  # Use 4-bit quantization to save memory
)

print("Model loaded successfully!")

Run it with:

python load_model.py

This loads the model in 4-bit mode, requiring about 20 GB of VRAM. If you have more memory, remove `load_in_4bit=True` for full precision.

Usage Examples

Example 1: Basic Code Generation (SWE-bench style)

Create a script `generate_patch.py`:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "./nemo-coder-34b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,
)

prompt = """You are a senior software engineer. Given the following bug report and code, generate a fix.

Bug: The function `calculate_total` does not handle empty lists. It should return 0.
Code:
def calculate_total(items):
    total = 0
    for item in items:
        total += item
    return total

Fix:"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Run it:

python generate_patch.py

The model should output a fix that adds an early return for empty lists.

Example 2: Mythos-style Architectural Reasoning

Now let’s test a Mythos-like task. Create `mythos_reasoning.py`:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "./nemo-coder-34b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,
)

prompt = """We have a web application with a REST API and a PostgreSQL database. We need to add a feature that sends a welcome email to new users within 5 minutes of registration. The email service is external and sometimes takes 30 seconds to respond. The system must not slow down the registration endpoint.

Describe the architecture change you would propose, including any new components, and write the code for the core logic. Be concise but complete."""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Run it:

python mythos_reasoning.py

The model should suggest using a message queue (e.g., RabbitMQ) or a background worker. It may generate Python code using Celery or asyncio.

Example 3: Fable-style Ambiguous Feature

Create `fable_feature.py`:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "./nemo-coder-34b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,
)

prompt = """Implement a feature for a notes app: users can "pin" a note so it appears at the top of the list. The pinned notes should also have a small star icon. Assume we are using React and a simple in-memory state. Provide the complete component code.

Note: The requirements are intentionally vague. Make reasonable assumptions and state them in comments."""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=600, temperature=0.5, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Run it:

python fable_feature.py

The model will likely produce a React component with a `pinned` boolean field, a sorting function, and a star icon from an icon library. It should include comments explaining assumptions (e.g., “assuming we use Font Awesome”).

Evaluating Your Results

To move beyond SWE-bench, you need to evaluate not just whether the code runs, but whether it meets the spirit of the requirement. For Mythos tasks, check:

  • Does the solution address all constraints?
  • Is the architecture scalable and maintainable?
  • Are trade-offs explicitly discussed?

For Fable tasks, check:

  • Does the code handle reasonable edge cases?
  • Are the assumptions documented?
  • Is the code clean and idiomatic?

You can create your own test suite by collecting issues from open-source projects and rewriting them as ambiguous feature requests. The Hugging Face Blog has examples of community-driven benchmarks that follow this pattern.

Conclusion

SWE-bench has served the AI community well, but it is not the final word on software engineering intelligence. Mythos and Fable push the evaluation frontier toward architectural reasoning and creative feature implementation—skills that separate a code generator from a true engineering partner. Nvidia’s open contender, with its long context window and multi-step reasoning capabilities, offers a practical way to explore these new benchmarks on your own hardware.

By setting up the model locally and running the examples above, you can experience firsthand how the next generation of coding agents handle ambiguity, design trade-offs, and system-level thinking. The era of testing AI on isolated bug fixes is ending. The era of testing AI as a collaborative architect is just beginning.

*Further reading: The Batch (deeplearning.ai), Google AI Blog, Microsoft AI Blog, and Hugging Face Blog for ongoing discussions on AI software engineering benchmarks.*

Sources

FAQ

What is this article about?

This article covers “Testing Mythos and Fable: Moving Beyond SWE-bench with Nvidia’s Open Contender” in the Guides category. Explore how Nvidia’s new open-source framework challenges SWE-bench dominance. Learn to test AI models with Mythos and Fable for real-world software engineering tasks.

Who is this useful for?

It is useful for readers who want a practical understanding of AI tools, models, and workflows.

What should I do next?

Read the article, review the listed sources, and test the most relevant ideas in your own workflow.