New Benchmark Tutorial
This example walks through adding a new benchmark and scoring existing models on it. Everything can be developed locally with full access to publicly available models, but we strongly encourage you to submit your benchmark to Brain-Score to make it accessible to the community, and to make it into a goalpost that future models can be measured against.
If you haven’t already, check out other benchmarks and the docs.
A benchmark reproduces the experimental paradigm on a model candidate, and tests model predictions against the experimentally observed data, using a similarity metric.
In other words, a benchmark consists of three things (each of which is a plugin):
experimental paradigm
biological data (neural/behavioral)
similarity metric
For the biological data and the similarity metric, benchmarks can use previously submitted data and metrics. I.e., re-combinations are very much valid.
Brain-Score secondarily also hosts benchmarks that do not pertain to neural or behavioral data, e.g. engineering (ML) benchmarks and other analyses. These benchmarks do not include biological data, and the metric might be ground-truth accuracy.
1. Package data (optional)
You can contribute new data by submitting a data plugin. If you are building a benchmark using existing data, you can skip this step.
We use the BrainIO format to organize data. Datasets in brainio are called assemblies and are based on xarray, a multi-dimensional version of pandas, which allows for metadata on numpy arrays of arbitrary dimensionality.
Most assemblies contain a presentation
dimension for the stimuli that were presented, as well as potentially other
dimensions for e.g. different subjects or different voxels.
The actual measurements (e.g. reading times, or voxel activity) are typically the values of an assembly.
Behavioral data
The following is an excerpt from the Futrell2018 data packaging.
from brainio.assemblies import BehavioralAssembly
reading_times = parse_experiment_data(...) # load the experimental data, e.g. from .csv files
# ... obtain as much metadata as we can ...
assembly = BehavioralAssembly(reading_times, coords={
'word': ('presentation', voc_word),
'stimulus_id': ('presentation', stimulus_ID),
...
'subject_id': ('subject', subjects),
'WorkTimeInSeconds': ('subject', WorkTimeInSeconds_meta),
...
}, dims=('presentation', 'subject'))
Neural data
The following is an excerpt from the Pereira2018 data packaging.
from brainio.assemblies import NeuroidAssembly
neural_recordings = parse_experiment_data(...) # load the experimental data, e.g. from .mat files
# ... obtain as much metadata as we can ...
assembly = NeuroidAssembly(neural_recordings, coords={
'stimulus': ('presentation', sentences),
'stimulus_id': ('presentation', stimulus_id),
...
'neuroid_id': ('neuroid', voxel_number),
'atlas': ('neuroid', atlases),
...
}, dims=['presentation', 'neuroid'])
Register the data plugin
So that your data can be accessed via an identifier, you need to define an endpoint in the plugin registry.
For instance, if your data is on S3, the plugin might look as follows:
from brainscore_language.utils.s3 import load_from_s3
def load_assembly() -> BehavioralAssembly:
assembly = load_from_s3(
identifier="Futrell2018",
version_id="MpR.gIXN8UrUnqwQyj.kCrh4VWrBvsGf",
sha1="381ccc8038fbdb31235b5f3e1d350f359b5e287f")
return assembly
data_registry['Futrell2018'] = load_assembly
Unit tests
To ensure the data is in the right format, and not corrupted by any future changes, we require all plugins to include
an accompanying test.py
file with unit tests.
For instance, here is a small unit test example validating the dimensions of a reading times dataset.
from brainscore_language import load_dataset
def test_shape(self):
assembly = load_dataset('Futrell2018')
assert len(assembly['presentation']) == 10256
assert len(assembly['subject']) == 180
These unit tests guarantee the continued validity of your plugin, so we encourage rigorous testing methods.
2. Create metric (optional)
You can contribute a new metric by submitting a metric plugin. If you are building a benchmark using an existing metric, you can skip this step.
Metrics compute the similarity between two measurements. These can be model-vs-human, human-vs-human, or model-model. Measurements could for instance be reading times, or fMRI recordings.
A simple metric could be the pearson correlation of two measurements:
import numpy as np
from scipy.stats import pearsonr
from brainio.assemblies import DataAssembly
from brainscore_core.metrics import Metric, Score
class PearsonCorrelation(Metric):
def __call__(self, assembly1: DataAssembly, assembly2: DataAssembly) -> Score:
rvalue, pvalue = pearsonr(assembly1, assembly2)
score = Score(np.abs(rvalue)) # similarity score between 0 and 1 indicating alignment of the two assemblies
return score
metric_registry['pearsonr'] = PearsonCorrelation
This is a very simple example and ignores e.g. checks ensuring the ordering is the same, cross-validation, or keeping track of metadata.
Unit tests
As with all plugins, please provide a test.py
file to ensure the continued validity of your metric.
For instance, the following is an excerpt from the
pearson correlation tests.
from brainscore_language import load_metric
def test_weak_correlation():
a1 = [1, 2, 3, 4, 5]
a2 = [3, 1, 6, 1, 2]
metric = load_metric('pearsonr')
score = metric(a1, a2)
assert score == approx(.152, abs=.005)
3. Build the benchmark
With data and metric in place, you can put the two together to build a benchmark that scores model similarity to behavioral or neural measurements.
Structure
A benchmark runs the experiment on a (model) subject candidate in the __call__
method,
and compares model predictions against experimental data.
All interactions with the model are via methods defined in the ArtificialSubject interface
– this allows all present and future models to be tested on your benchmark.
For example:
from brainscore_core.benchmarks import BenchmarkBase
from brainscore_language import load_dataset, load_metric, ArtificialSubject
class MyBenchmark(BenchmarkBase):
def __init__(self):
self.data = load_dataset('mydata')
self.metric = load_metric('pearsonr')
...
def __call__(self, candidate: ArtificialSubject) -> Score:
candidate.start_behavioral_task(ArtificialSubject.Task.reading_times) # or any other task
# or e.g. candidate.start_start_recording(recording_target=ArtificialSubject.RecordingTarget.language_system,
# recording_type=ArtificialSubject.RecordingType.fMRI)
predictions = candidate.digest_text(stimuli)['behavior']
raw_score = self.metric(predictions, self.data)
score = ceiling_normalize(raw_score, self.ceiling)
return score
Behavioral benchmark
To test for behavioral alignment, benchmarks compare model outputs to human behavioral measurements. The model is instructed to perform a certain task (e.g. output reading times), and then prompted to digest text input, for which it will output behavioral predictions.
For instance, here is a sample excerpt from the Futrell2018 benchmark comparing reading times:
class Futrell2018Pearsonr(BenchmarkBase):
...
def __call__(self, candidate: ArtificialSubject) -> Score:
candidate.start_behavioral_task(ArtificialSubject.Task.reading_times)
stimuli = self.data['stimulus']
predictions = candidate.digest_text(stimuli.values)['behavior']
raw_score = self.metric(predictions, self.data)
score = ceiling_normalize(raw_score, self.ceiling)
return score
benchmark_registry['Futrell2018-pearsonr'] = Futrell2018Pearsonr
Neural benchmark
To test for neural alignment, benchmarks compare model internals to human internal neural activity, measured e.g. via fMRI or ECoG. Running the experiment on the model subject, the benchmark first instructs where and how to perform neural recording, and then prompts the subject with text input, for which the model will output neural predictions.
For instance, here is a sample excerpt from the Pereira2018 linear-predictivity benchmark linearly comparing fMRI activity:
class Pereira2018Linear(BenchmarkBase):
...
def __call__(self, candidate: ArtificialSubject) -> Score:
candidate.start_neural_recording(recording_target=ArtificialSubject.RecordingTarget.language_system,
recording_type=ArtificialSubject.RecordingType.fMRI)
stimuli = self.data['stimulus']
predictions = candidate.digest_text(stimuli.values)['neural']
raw_score = self.metric(predictions, self.data)
score = ceiling_normalize(raw_score, self.ceiling)
return score
benchmark_registry['Pereira2018-linear'] = Pereira2018Linear
Ceiling
You might have noticed that model alignment scores are always relative to a ceiling. The ceiling is an estimate of how well the “perfect model” would perform. Often, this is an estimate of how well an average human is aligned to the specific data.
For instance, the Pereira2018 ceiling compares the linear alignment (i.e. using the same metric) of n-1 subjects to a heldout subject. The Futrell2018 ceiling compares how well one half of subjects is aligned to the other half of subjects, again using the same metric that is used for model comparisons.
Running models on your benchmark
You can now locally run models on your benchmark (see 4. Submit to Brain-Score for running models on the Brain-Score platform). Run the score function, passing in the desired model identifier(s) and the identifier for your benchmark.
For instance, you might run:
from brainscore_language import score
model_score = score(model_identifier='distilgpt2', benchmark_identifier='benchmarkid-metricid')
Unit tests
As with all plugins, please provide a test.py
file to ensure the continued validity of your benchmark.
For instance, the following is an excerpt from the
Futrell2018 tests:
from brainscore_language import ArtificialSubject, load_benchmark
class DummyModel(ArtificialSubject):
def __init__(self, reading_times):
self.reading_times = reading_times
def digest_text(self, stimuli):
return {'behavior': BehavioralAssembly(self.reading_times, coords={
'context': ('presentation', stimuli),
'stimulus_id': ('presentation', np.arange(len(stimuli)))},
dims=['presentation'])}
def start_behavioral_task(self, task: ArtificialSubject.Task):
if task != ArtificialSubject.Task.reading_times:
raise NotImplementedError()
def test_dummy_bad():
benchmark = load_benchmark('Futrell2018-pearsonr')
reading_times = RandomState(0).random(10256)
dummy_model = DummyModel(reading_times=reading_times)
score = benchmark(dummy_model)
assert score == approx(0.0098731 / .858, abs=0.001)
def test_ceiling():
benchmark = load_benchmark('Futrell2018-pearsonr')
ceiling = benchmark.ceiling
assert ceiling == approx(.858, abs=.0005)
assert ceiling.raw.median('split') == ceiling
assert ceiling.uncorrected_consistencies.median('split') < ceiling
Benchmark Card Creation
Please include a README.md
file along with your benchmark to aid users with understanding and implementation.
As part of your README.md
, please include a YAML “Benchmark Card” section, detailing your benchmark, and using the
following format as a guideline. (NOTE: For cases where multiple benchmarks are submitted in a single plugin; a single
YAML could be appropriate for two very similar benchmarks, and separate YAMLs could be more fitting for dissimilar
benchmarks. This is left as a decision for the creator).
benchmark_details:
name: <name of the benchmark>
developer: <developing individual or group>
date: <date of benchmark creation>
version: <version number>
type: <behavioral (tests against behavioral data), neural (tests against neural data), or engineering (others,
typically test on ground-truth)>
description: <a short summary description>
license: <license details>
questions: <where to send questions>
references: <citation/reference information if relevant>
experiment:
task: <list of ArtificialSubject task value(s) if any>
recording: <ArtificialSubject Recording type(s) if any>
bidirectionality: <(if relevant) unidirectional/bidirectional: whether bidirectionality was used to obtain e.g.,
internal recordings of the model>
data:
accessibility: <public or private>
type: <behavioral/neural and modality, e.g. neural; fMRI>
granularity: <neural data granularity, e.g. whether there is any aggregation such as fROIs>
method: <how was the data obtained e.g. # of participants, demographics, # of unique items, reps per item, etc.>
data_card: <reference any existing data cards>
references: <abbreviated Bibtex>
metric:
metric: <e.g. PearsonR, accuracy>
mapping: <how model predictions are mapped to neural data if at all, e.g., RidgeCV, LinReg, RSA>
metric_card: <reference any existing metric cards>
error_estimation: <methods used for estimating errors>
ethical_considerations: <any relevant ethical considerations>
recommendations: <any relevant caveats and recommendations>
example_usage: <one example should be in test_integration.py, others can be included>
4. Submit to Brain-Score
To share your plugins (data, metrics, and/or benchmarks) with the community and to make them accessible for continued model evaluation, please submit them to the platform.
There are two main ways to do that:
By uploading a zip file on the website
By submitting a github pull request with the proposed changes
Both options result in the same outcome: your plugin will automatically be tested, and added to the codebase after it passes tests.
Particulars on data
To make data assemblies accessible for Brain-Score model evaluations, it needs to be uploaded. You can self-host your data (e.g. on S3/OSF), or contact us to host your data on S3. You can also choose to keep your data private such that models can be scored, but the data cannot be accessed.
For uploading data to S3, see the upload_data_assembly
in utils/s3.