LLM-as-a-Judge

Where is this feature available?

Hobby
Full
Pro
Full
Team
Full
Self Hosted
Pro & Enterprise(Pro & Enterprise)

LLM-as-a-judge is a technique to evaluate the quality of LLM applications by using an LLM as a judge. The LLM is given a trace or a dataset entry and asked to score and reason about the output. The scores and reasoning are stored as scores in Langfuse.

What are common evaluation tasks?

LLM-as-a-judge evaluation tasks can be very use-case-specific. Common tasks for which Langfuse provides prebuilt prompts are:

Hallucination
Helpfulness
Relevance
Toxicity
Correctness
Contextrelevance
Contextcorrectness
Conciseness

LLM-as-a-judge evaluators in Langfuse help to evaluate:

Production/development traces
Experiments that you run on datasets

Alternatively, you can run any custom evaluation functions or packages on Langfuse data via the API/SDKs.

Custom end-to-end example: External evaluation pipeline.

Video Walkthrough

Model-based Evaluations in Langfuse

Get Started

Configure LLM provider

Langfuse supports a variety of LLM providers including OpenAI, Anthropic, Azure OpenAI, and AWS Bedrock.

To use LLM-as-a-judge, you have to configure your LLM provider in the Langfuse project settings.

Note: tool/function calling needs to be supported by the model for LLM-as-a-judge to work.

Create an LLM-as-a-judge template

LLM-as-a-judge uses a prompt template and model configuration to evaluate traces. In Langfuse this configuration is stored in an Evaluator Template as it can be reused across multiple evaluators.

To help get you started, Langfuse includes a set of predefined prompts for common evaluation tasks, but you can also write your own or customize the Langfuse-provided prompts.

Prompt templates contain {{variables}} that are substituted with actual data when an evaluator is run. You can create an arbitrary number of custom variables that can later be referenced when creating the evaluator. Common variables are input, output, context, ground_truth, etc.

Langfuse uses function/tool calling to extract the evaluation output. At the bottom of the form, you can configure score and reasoning variables which will be used to instruct the LLM on how to score and reason about the evaluation.

Currently, LLM-as-a-judge templates only support numeric scores. Support for categorical and boolean scores is on our roadmap. (GitHub Issue)

Set up an evaluator

Now that you have created an evaluator template, you can configure on what data it should be applied by Langfuse.

Here we need to configure the following aspects:

Which Evaluator Template to use
Trigger: On what incoming data should the evaluator be executed?
- Traces: On new traces that are ingested into Langfuse. You can configure filters to select a subset of traces.
- Datasets: On all experiments run on a specific dataset in offline development.
Name of the scores which will be created as a result of the evaluation.
Specify how Langfuse should fill the {{variables}} in the template.
- Langfuse traces can be deeply nested (see conceptual overview). You can query from the trace directly, or from any nested observation via its name.
- Select whether to use the Input, Output, or metadata value.
Optional: Add sampling to reduce costs when running evaluations on a large volume of production data.
Optional: Configure custom delay. This is how you can ensure all data arrived at Langfuse servers before evaluation is executed. The time starts when the trace is first added to Langfuse while it might be still in progress. This is especially important for long-running agent executions.

✨ Done! You have created an evaluator which will now automatically be executed on all data that matches the selected trigger.

Langfuse

Monitoring of Evaluators

Each evaluator has its own log page where you can view the progress and logs to potentially debug any issues.

Langfuse

Troubleshooting

LLM proxies

You can use an LLM proxy to power LLM-as-a-judge in Langfuse. Please create an LLM API Key in the project settings and set the base URL to resolve to your proxy’s host. The proxy must accept the API format of one of our adapters and support tool calling.

For OpenAI compatible proxies, here is an example tool calling request that must be handled by the proxy in OpenAI format to support LLM-as-a-judge in Langfuse:

curl -X POST 'https://<host set in project settings>/chat/completions' \
-H 'accept: application/json' \
-H 'content-type: application/json' \
-H 'authorization: Bearer <api key entered in project settings>' \
-H 'x-test-header-1: <custom header set in project settings>' \
-H 'x-test-header-2: <custom header set in project settings>' \
-d '{
  "model": "<model set in project settings>",
  "temperature": 0,
  "top_p": 1,
  "frequency_penalty": 0,
  "presence_penalty": 0,
  "max_tokens": 256,
  "n": 1,
  "stream": false,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "extract",
        "parameters": {
          "type": "object",
          "properties": {
            "score": {
              "type": "string"
            },
            "reasoning": {
              "type": "string"
            }
          },
          "required": [
            "score",
            "reasoning"
          ],
          "additionalProperties": false,
          "$schema": "http://json-schema.org/draft-07/schema#"
        }
      }
    }
  ],
  "tool_choice": {
    "type": "function",
    "function": {
      "name": "extract"
    }
  },
  "messages": [
    {
      "role": "user",
      "content": "Evaluate the correctness of the generation on a continuous scale from 0 to 1. A generation can be considered correct (Score: 1) if it includes all the key facts from the ground truth and if every fact presented in the generation is factually supported by the ground truth or common sense.\n\nExample:\nQuery: Can eating carrots improve your vision?\nGeneration: Yes, eating carrots significantly improves your vision, especially at night. This is why people who eat lots of carrots never need glasses. Anyone who tells you otherwise is probably trying to sell you expensive eyewear or does not want you to benefit from this simple, natural remedy. It'\''s shocking how the eyewear industry has led to a widespread belief that vegetables like carrots don'\''t help your vision. People are so gullible to fall for these money-making schemes.\nGround truth: Well, yes and no. Carrots won'\''t improve your visual acuity if you have less than perfect vision. A diet of carrots won'\''t give a blind person 20/20 vision. But, the vitamins found in the vegetable can help promote overall eye health. Carrots contain beta-carotene, a substance that the body converts to vitamin A, an important nutrient for eye health.  An extreme lack of vitamin A can cause blindness. Vitamin A can prevent the formation of cataracts and macular degeneration, the world'\''s leading cause of blindness. However, if your vision problems aren'\''t related to vitamin A, your vision won'\''t change no matter how many carrots you eat.\nScore: 0.1\nReasoning: While the generation mentions that carrots can improve vision, it fails to outline the reason for this phenomenon and the circumstances under which this is the case. The rest of the response contains misinformation and exaggerations regarding the benefits of eating carrots for vision improvement. It deviates significantly from the more accurate and nuanced explanation provided in the ground truth.\n\n\n\nInput:\nQuery: {{query}}\nGeneration: {{generation}}\nGround truth: {{ground_truth}}\n\n\nThink step by step."
    }
  ]
}'

LLM-as-a-Judge

Video Walkthrough

Get Started

Configure LLM provider

Create an LLM-as-a-judge template

Set up an evaluator

Monitoring of Evaluators

Troubleshooting

LLM proxies

GitHub Discussions

Was this page useful?

Questions? We're here to help

Subscribe to updates