Home/Turify Templates

Turify Templates

Browse our comprehensive collection of Turify templates for evaluating language models

Categories

๐Ÿ“„ Blank Template

Performance & Reliability

Start with a clean slate to build your own custom flow from scratch.

๐Ÿ“‘ Compare between prompt templates

Performance & Reliability

Compare between prompt templates using template chaining. Visualize response quality across models.

๐Ÿ“Š Compare prompt across models

Performance & Reliability

A simple evaluation with a prompt template, some inputs, and three models to prompt. Visualizes variability in response length.

๐Ÿค– Compare system prompts

Performance & Reliability

Compares response quality across different system prompts. Visualizes how well it sticks to the instructions to only print Racket code.

๐Ÿ“— Testing knowledge of book beginnings

Performance & Reliability

Test whether different LLMs know the first sentences of famous books.

โ›“๏ธ Extract data with prompt chaining

Business Integration

Chain one prompt into another to extract entities from a text response. Plots number of entities.

๐Ÿ’ฌ๐Ÿ™‡ Estimate chat model sycophancy

Ethics & Content Safety

Estimate how sycophantic a chat model is: ask it for a well-known fact, then tell it it's wrong, and check whether it apologizes or changes its answer.

๐Ÿงช Audit models for gender bias

Ethics & Content Safety

Asks an LLM to estimate the gender of a person, given a profession and salary.

๐Ÿ›‘ Red-teaming of stereotypes about nationalities

Ethics & Content Safety

Check for whether models refuse to generate stereotypes about people from different countries.

๐Ÿฆ Multi-evals of prompt to extract structured data from tweets

Business Integration

Extracts named entities from a dataset of tweets, and double-checks the output against multiple eval criteria.

๐Ÿงฎ Produce structured outputs

Business Integration

Extract information from a dataset and output it in a structured JSON format using OpenAI's structured outputs feature.

๐Ÿ”จ Detect whether tool is triggered

Business Integration

Basic example showing whether a given prompt triggered tool usage.

๐Ÿ“‘ Compare output format

Customer Experience

Check whether asking for a different format (YAML, XML, JSON, etc.) changes the content.

๐Ÿง‘โ€๐Ÿ’ป๏ธ HumanEvals Python coding benchmark

Performance & Reliability

Run the HumanEvals Python coding benchmark to evaluate LLMs on Python code completion, entirely in your browser. A classic!

๐Ÿ—ฏ Check robustness to prompt injection attacks

Security & Compliance

Get a sense of different model's robustness against prompt injection attacks.

๐Ÿ”ข Ground truth evaluation for math problems

Performance & Reliability

Uses a Tabular Data Node to evaluate LLM performance on basic math problems. Compares responses to expected answer and plots performance.

๐ŸฆŸ Test knowledge of mosquitos

Performance & Reliability

Uses an LLM scorer to test whether LLMs know the difference between lifetimes of male and female mosquitos.

๐Ÿ–ผ Generate images of animals

Customer Experience

Shows images of a fox, sparrow, and a pig as a computer scientist and a gamer, using Dall-E2.