All official submissions to the BALROG leaderboard are maintained at BALROG/experiments

Submit to BALROG Leaderboard

If you are interested in submitting your agent to the BALROG Leaderboard, please do the following:

  1. Fork and clone the BALROG/experiments repository.
  2. Create a new folder with the submission date and the agent name (e.g. 2024-09-21_balrog_gpt4o).
  3. Copy the log of the run of your agent, please include the following files from your agent's evaluation:
    • babaisai: babaisai folder, containing summary and trajectory logs
    • babyai: babyai folder, containing summary and trajectory logs
    • crafter: crafter folder, containing summary and trajectory logs
    • minihack: minihack folder, containing summary and trajectory logs
    • nle: nethack folder, containing summary and trajectory logs
    • textworld: textworld folder, containing summary and trajectory logs
    • summary.json: Summary of the evaluation outcomes for all environments

    NOTE: You shouldn't have to create any of these files. They should automatically be generated by BALROG evaluation.

  4. metadata.yaml: Metadata for how the result is shown on website. Please include the following fields:
    • name: The name of your leaderboard entry
    • oss: true if your agent (model + strategy) is open-source
    • site: URL/link to more information about your agent
    • verified: false (See below for results verification)
  5. README.md: Include anything you'd like to share about your agent here!
  6. Run python submit.py <path-to-submission>
  7. Create a pull request to the BALROG/experiments repository with the new folder.

You can refer to this tutorial for a quick overview of how to evaluate your agent on BALROG.

Verify Your Results

The Verified check ✓ indicates that we (the BALROG team) received access to your agent and were able to reproduce a selection of the results.

If you are interested in receiving the "verified" checkmark ✓ on your submission, please do the following:

  1. Create an issue
  2. In the issue, provide us instructions on how to run your agent on BALROG.
  3. We will run your agent on a random subset of BALROG and verify the results.