Evaluate large language models across a variety of tasks.