Template Library

📄 Blank Template

Performance & Reliability

Start with a clean slate to build your own custom flow from scratch.

More Information

📑 Compare between prompt templates

Performance & Reliability

Compare between prompt templates using template chaining. Visualize response quality across models.

More Information

📊 Compare prompt across models

Performance & Reliability

A simple evaluation with a prompt template, some inputs, and three models to prompt. Visualizes variability in response length.

More Information

🤖 Compare system prompts

Performance & Reliability

Compares response quality across different system prompts. Visualizes how well it sticks to the instructions to only print Racket code.

More Information

📗 Testing knowledge of book beginnings

Performance & Reliability

Test whether different LLMs know the first sentences of famous books.

More Information

⛓️ Extract data with prompt chaining

Business Integration

Chain one prompt into another to extract entities from a text response. Plots number of entities.

More Information

💬🙇 Estimate chat model sycophancy

Ethics & Content Safety

Estimate how sycophantic a chat model is: ask it for a well-known fact, then tell it it's wrong, and check whether it apologizes or changes its answer.

More Information

🧪 Audit models for gender bias

Ethics & Content Safety

Asks an LLM to estimate the gender of a person, given a profession and salary.

More Information

🛑 Red-teaming of stereotypes about nationalities

Ethics & Content Safety

Check for whether models refuse to generate stereotypes about people from different countries.

More Information

🐦 Multi-evals of prompt to extract structured data from tweets

Business Integration

Extracts named entities from a dataset of tweets, and double-checks the output against multiple eval criteria.

More Information

🧮 Produce structured outputs

Business Integration

Extract information from a dataset and output it in a structured JSON format using OpenAI's structured outputs feature.

More Information

🔨 Detect whether tool is triggered

Business Integration

Basic example showing whether a given prompt triggered tool usage.

More Information

📑 Compare output format

Customer Experience

Check whether asking for a different format (YAML, XML, JSON, etc.) changes the content.

More Information

🧑‍💻️ HumanEvals Python coding benchmark

Performance & Reliability

Run the HumanEvals Python coding benchmark to evaluate LLMs on Python code completion, entirely in your browser. A classic!

More Information

🗯 Check robustness to prompt injection attacks

Security & Compliance

Get a sense of different model's robustness against prompt injection attacks.

More Information

🔢 Ground truth evaluation for math problems

Performance & Reliability

Uses a Tabular Data Node to evaluate LLM performance on basic math problems. Compares responses to expected answer and plots performance.

More Information

🦟 Test knowledge of mosquitos

Performance & Reliability

Uses an LLM scorer to test whether LLMs know the difference between lifetimes of male and female mosquitos.

More Information

🖼 Generate images of animals

Customer Experience

Shows images of a fox, sparrow, and a pig as a computer scientist and a gamer, using Dall-E2.

More Information

Turify Templates

Categories

📄 Blank Template

📑 Compare between prompt templates

📊 Compare prompt across models

🤖 Compare system prompts

📗 Testing knowledge of book beginnings

⛓️ Extract data with prompt chaining

💬🙇 Estimate chat model sycophancy

🧪 Audit models for gender bias

🛑 Red-teaming of stereotypes about nationalities

🐦 Multi-evals of prompt to extract structured data from tweets

🧮 Produce structured outputs

🔨 Detect whether tool is triggered

📑 Compare output format

🧑‍💻️ HumanEvals Python coding benchmark

🗯 Check robustness to prompt injection attacks

🔢 Ground truth evaluation for math problems

🦟 Test knowledge of mosquitos

🖼 Generate images of animals