Submit to BALROG Leaderboard
If you are interested in submitting your agent to the BALROG Leaderboard, please do the following:
- Fork and clone the BALROG/experiments repository.
-
Create a new folder with the submission date and the agent name in the LLM or VLM directory
(e.g.
submissions/LLM/2024-09-21_balrog_gpt4o). -
Copy the log of the run of your agent, please include the following files from your agent's
evaluation:
babaisai: babaisai folder, containing summary and trajectory logsbabyai: babyai folder, containing summary and trajectory logscrafter: crafter folder, containing summary and trajectory logsminihack: minihack folder, containing summary and trajectory logsnle: nethack folder, containing summary and trajectory logstextworld: textworld folder, containing summary and trajectory logssummary.json: Summary of the evaluation outcomes for all environments
NOTE: You shouldn't have to create any of these files. They should automatically be generated by BALROG evaluation.
-
metadata.yaml: Metadata for how the result is shown on website. Please include the following fields:name: The name of your leaderboard entryoss:trueif your agent (model + strategy) is open-sourcesite: URL/link to more information about your agentverified:false(See below for results verification)date: submission date in string format, (e.g. "2024-12-09")
-
README.md: Include anything you'd like to share about your agent here! -
Run
python submit.py <path-to-submission> - Create a pull request to the BALROG/experiments repository with the new folder.
You can refer to this tutorial for a quick overview of how to evaluate your agent on BALROG.
Verify Your Results
The Verified check ✓ indicates that we (the BALROG team) received access to your agent and were able to reproduce a selection of the results.
If you are interested in receiving the "verified" checkmark ✓ on your submission, please do the following:
- Create an issue
- In the issue, provide us instructions on how to run your agent on BALROG.
- We will run your agent on a random subset of BALROG and verify the results.