Turify Templates
Browse our comprehensive collection of Turify templates for evaluating language models
Categories
๐ Blank Template
Start with a clean slate to build your own custom flow from scratch.
๐ Compare between prompt templates
Compare between prompt templates using template chaining. Visualize response quality across models.
๐ Compare prompt across models
A simple evaluation with a prompt template, some inputs, and three models to prompt. Visualizes variability in response length.
๐ค Compare system prompts
Compares response quality across different system prompts. Visualizes how well it sticks to the instructions to only print Racket code.
๐ Testing knowledge of book beginnings
Test whether different LLMs know the first sentences of famous books.
โ๏ธ Extract data with prompt chaining
Chain one prompt into another to extract entities from a text response. Plots number of entities.
๐ฌ๐ Estimate chat model sycophancy
Estimate how sycophantic a chat model is: ask it for a well-known fact, then tell it it's wrong, and check whether it apologizes or changes its answer.
๐งช Audit models for gender bias
Asks an LLM to estimate the gender of a person, given a profession and salary.
๐ Red-teaming of stereotypes about nationalities
Check for whether models refuse to generate stereotypes about people from different countries.
๐ฆ Multi-evals of prompt to extract structured data from tweets
Extracts named entities from a dataset of tweets, and double-checks the output against multiple eval criteria.
๐งฎ Produce structured outputs
Extract information from a dataset and output it in a structured JSON format using OpenAI's structured outputs feature.
๐จ Detect whether tool is triggered
Basic example showing whether a given prompt triggered tool usage.
๐ Compare output format
Check whether asking for a different format (YAML, XML, JSON, etc.) changes the content.
๐งโ๐ป๏ธ HumanEvals Python coding benchmark
Run the HumanEvals Python coding benchmark to evaluate LLMs on Python code completion, entirely in your browser. A classic!
๐ฏ Check robustness to prompt injection attacks
Get a sense of different model's robustness against prompt injection attacks.
๐ข Ground truth evaluation for math problems
Uses a Tabular Data Node to evaluate LLM performance on basic math problems. Compares responses to expected answer and plots performance.
๐ฆ Test knowledge of mosquitos
Uses an LLM scorer to test whether LLMs know the difference between lifetimes of male and female mosquitos.
๐ผ Generate images of animals
Shows images of a fox, sparrow, and a pig as a computer scientist and a gamer, using Dall-E2.