Quickstart¶
The Trustworthy Language Model scores the trustworthiness of every LLM response in real-time, automatically flagging when the model's response may be incorrect. TLM can detect incorrect outputs from any LLM model and can score any type of model output (natural language response, classification decision, structured output, tool-call, etc).
This tutorial demonstrates how to quickly make any LLM application more reliable with TLM; other tutorials demonstrate how to better utilize TLM in specific applications.
Setup¶
This tutorial requires an API key for an LLM provider.
Some possibilities include: OPENAI_API_KEY, GEMINI_API_KEY, DEEPSEEK_API_KEY, AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY, etc.
The TLM Python client can be installed using pip:
%pip install --upgrade trustworthy-llm
# Set your API key
import os
os.environ["OPENAI_API_KEY"] = "<API key>" # or other LLM provider API key
Using TLM¶
You can use TLM pretty much like any other LLM API:
from tlm import TLM
tlm = TLM() # See Advanced Tutorial for optional TLM configurations to get better/faster results
openai_kwargs = {"model": "gpt-4.1-mini", "messages": [{"role": "user", "content": "What is the capital of France?"}]}
tlm_result = tlm.create(**openai_kwargs)
tlm_result
{'response': ModelResponse(id='chatcmpl-Cvp7cQSXi0AYaYHhLMtY6Pr1pVDkf', created=1767897244, model='gpt-4.1-mini-2025-04-14', object='chat.completion', system_fingerprint='fp_376a7ccef1', choices=[Choices(finish_reason='stop', index=0, message=Message(content='The capital of France is Paris.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token='Response', bytes=[82, 101, 115, 112, 111, 110, 115, 101], logprob=0.0, top_logprobs=[TopLogprob(token='Response', bytes=[82, 101, 115, 112, 111, 110, 115, 101], logprob=0.0), TopLogprob(token=' Response', bytes=[32, 82, 101, 115, 112, 111, 110, 115, 101], logprob=-24.5), TopLogprob(token='\tResponse', bytes=[9, 82, 101, 115, 112, 111, 110, 115, 101], logprob=-24.5), TopLogprob(token='<Response', bytes=[60, 82, 101, 115, 112, 111, 110, 115, 101], logprob=-25.5), TopLogprob(token='_Response', bytes=[95, 82, 101, 115, 112, 111, 110, 115, 101], logprob=-25.875)]), ChatCompletionTokenLogprob(token=':', bytes=[58], logprob=0.0, top_logprobs=[TopLogprob(token=':', bytes=[58], logprob=0.0), TopLogprob(token=':**', bytes=[58, 42, 42], logprob=-27.25), TopLogprob(token=':The', bytes=[58, 84, 104, 101], logprob=-27.75), TopLogprob(token=':', bytes=[239, 188, 154], logprob=-29.4375), TopLogprob(token=':[', bytes=[58, 91], logprob=-30.1875)]), ChatCompletionTokenLogprob(token=' The', bytes=[32, 84, 104, 101], logprob=-0.004078401252627373, top_logprobs=[TopLogprob(token=' The', bytes=[32, 84, 104, 101], logprob=-0.004078401252627373), TopLogprob(token=' Paris', bytes=[32, 80, 97, 114, 105, 115], logprob=-5.504078388214111), TopLogprob(token='The', bytes=[84, 104, 101], logprob=-19.879077911376953), TopLogprob(token='Paris', bytes=[80, 97, 114, 105, 115], logprob=-21.504077911376953), TopLogprob(token=' the', bytes=[32, 116, 104, 101], logprob=-23.254077911376953)]), ChatCompletionTokenLogprob(token=' capital', bytes=[32, 99, 97, 112, 105, 116, 97, 108], logprob=0.0, top_logprobs=[TopLogprob(token=' capital', bytes=[32, 99, 97, 112, 105, 116, 97, 108], logprob=0.0), TopLogprob(token=' capitale', bytes=[32, 99, 97, 112, 105, 116, 97, 108, 101], logprob=-22.375), TopLogprob(token='capital', bytes=[99, 97, 112, 105, 116, 97, 108], logprob=-22.5), TopLogprob(token=' Capital', bytes=[32, 67, 97, 112, 105, 116, 97, 108], logprob=-26.25), TopLogprob(token=' capitalize', bytes=[32, 99, 97, 112, 105, 116, 97, 108, 105, 122, 101], logprob=-27.375)]), ChatCompletionTokenLogprob(token=' of', bytes=[32, 111, 102], logprob=0.0, top_logprobs=[TopLogprob(token=' of', bytes=[32, 111, 102], logprob=0.0), TopLogprob(token=' של', bytes=[32, 215, 169, 215, 156], logprob=-26.0), TopLogprob(token='ของ', bytes=[224, 184, 130, 224, 184, 173, 224, 184, 135], logprob=-26.875), TopLogprob(token=' của', bytes=[32, 99, 225, 187, 167, 97], logprob=-27.0), TopLogprob(token='of', bytes=[111, 102], logprob=-27.6875)]), ChatCompletionTokenLogprob(token=' France', bytes=[32, 70, 114, 97, 110, 99, 101], logprob=0.0, top_logprobs=[TopLogprob(token=' France', bytes=[32, 70, 114, 97, 110, 99, 101], logprob=0.0), TopLogprob(token=' Paris', bytes=[32, 80, 97, 114, 105, 115], logprob=-19.75), TopLogprob(token='France', bytes=[70, 114, 97, 110, 99, 101], logprob=-20.0), TopLogprob(token=' Frankreich', bytes=[32, 70, 114, 97, 110, 107, 114, 101, 105, 99, 104], logprob=-23.5), TopLogprob(token=' Francia', bytes=[32, 70, 114, 97, 110, 99, 105, 97], logprob=-23.75)]), ChatCompletionTokenLogprob(token=' is', bytes=[32, 105, 115], logprob=0.0, top_logprobs=[TopLogprob(token=' is', bytes=[32, 105, 115], logprob=0.0), TopLogprob(token='is', bytes=[105, 115], logprob=-28.25), TopLogprob(token='是', bytes=[230, 152, 175], logprob=-28.5), TopLogprob(token=' are', bytes=[32, 97, 114, 101], logprob=-29.375), TopLogprob(token='\tis', bytes=[9, 105, 115], logprob=-30.1875)]), ChatCompletionTokenLogprob(token=' Paris', bytes=[32, 80, 97, 114, 105, 115], logprob=0.0, top_logprobs=[TopLogprob(token=' Paris', bytes=[32, 80, 97, 114, 105, 115], logprob=0.0), TopLogprob(token='Paris', bytes=[80, 97, 114, 105, 115], logprob=-19.125), TopLogprob(token=' París', bytes=[32, 80, 97, 114, 195, 173, 115], logprob=-21.125), TopLogprob(token=' Пари', bytes=[32, 208, 159, 208, 176, 209, 128, 208, 184], logprob=-21.25), TopLogprob(token=' the', bytes=[32, 116, 104, 101], logprob=-23.0)]), ChatCompletionTokenLogprob(token='.', bytes=[46], logprob=0.0, top_logprobs=[TopLogprob(token='.', bytes=[46], logprob=0.0), TopLogprob(token='.]', bytes=[46, 93], logprob=-21.0), TopLogprob(token='.</', bytes=[46, 60, 47], logprob=-21.5), TopLogprob(token='.\n\n', bytes=[46, 10, 10], logprob=-22.25), TopLogprob(token='.\n', bytes=[46, 10], logprob=-23.25)])], refusal=None))], usage=Usage(completion_tokens=9, prompt_tokens=34, total_tokens=43, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default'),
'trustworthiness_score': np.float64(0.9997975077878962),
'usage': {'num_input_tokens': 34, 'num_output_tokens': 9},
'metadata': {},
'evals': {},
'explanation': 'Did not find a reason to doubt trustworthiness.'}
TLM's result will be a dict with the following fields:
{
"response": ModelResponse(...) # Full model response object (like OpenAI's ChatCompletion)
"trustworthiness_score": 0.87 # numerical value between 0-1
"usage": {} # Token usage info
"metadata": {} # Additional metadata dict
"evals": {} # Additional evaluation results dict (if evals specified)
"explanation": "Did not find a reason to doubt trustworthiness." # String explanation
}
The response is a full model response object (e.g., OpenAI's ChatCompletion or similar) containing the generated text, model info, token usage, and other standard LLM response fields. You can access the text content via result["response"].choices[0].message.content (or similar, depending on the provider).
The trustworthiness_score quantifies how confident you can be that the response is correct (higher values indicate greater trustworthiness). These scores are computed via state-of-the-art uncertainty estimation for LLMs.
The usage field provides token usage information, metadata contains additional metadata, evals contains optional evaluation results, and explanation provides a human-readable explanation of the trustworthiness assessment.
Boost the reliability of any LLM application by adding contingency plans to handle LLM responses whose trustworthiness score is low (e.g. escalate to human, append disclaimer, revert to a fallback answer, request more information from user, ...).
print("LLM response: ", tlm_result["response"].choices[0].message.content)
print("Trustworthiness score: ", tlm_result["trustworthiness_score"])
LLM response: The capital of France is Paris. Trustworthiness score: 0.9997975077878962
Scoring the trustworthiness of a given response¶
TLM can also score the trustworthiness of any response to a given prompt. The response could be from any LLM you're using, or even be human-written.
import openai
openai_kwargs = {"model": "gpt-4.1-mini", "messages": [{"role": "user", "content": "What is the capital of France?"}]}
openai_response = openai.chat.completions.create(**openai_kwargs)
openai_response
ChatCompletion(id='chatcmpl-Cvp7hhWjbfLS3mSjCjOcDXyWhuyMF', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1767897249, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier='default', system_fingerprint='fp_376a7ccef1', usage=CompletionUsage(completion_tokens=7, prompt_tokens=14, total_tokens=21, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))
You can then pass the response from your LLM directly into TLM (alongside the original arguments used to generate the response) for trustworthiness scoring:
tlm = TLM()
tlm_result = tlm.score(**openai_kwargs, response=openai_response)
tlm_result
{'response': {'chat_completion': {'id': 'chatcmpl-Cvp7hhWjbfLS3mSjCjOcDXyWhuyMF',
'choices': [{'finish_reason': 'stop',
'index': 0,
'logprobs': None,
'message': {'content': 'The capital of France is Paris.',
'refusal': None,
'role': 'assistant',
'annotations': [],
'audio': None,
'function_call': None,
'tool_calls': None}}],
'created': 1767897249,
'model': 'gpt-4.1-mini-2025-04-14',
'object': 'chat.completion',
'service_tier': 'default',
'system_fingerprint': 'fp_376a7ccef1',
'usage': {'completion_tokens': 7,
'prompt_tokens': 14,
'total_tokens': 21,
'completion_tokens_details': {'accepted_prediction_tokens': 0,
'audio_tokens': 0,
'reasoning_tokens': 0,
'rejected_prediction_tokens': 0},
'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}}},
'trustworthiness_score': np.float64(1.0),
'usage': {},
'metadata': {},
'evals': {},
'explanation': 'Did not find a reason to doubt trustworthiness.'}
The output dictionary is similar to the generate() method. You can similarly extract the trustworthiness score and response from the output.
print("LLM response: ", tlm_result["response"]["chat_completion"]["choices"][0]["message"]["content"])
print("Trustworthiness score: ", tlm_result["trustworthiness_score"])
LLM response: The capital of France is Paris. Trustworthiness score: 1.0
For example, TLM returns a high score when your LLM's response is confidently accurate:
openai_kwargs = {
"model": "gpt-4.1-mini",
"messages": [{"role": "user", "content": "What's the first month of the year?"}],
}
openai_response = openai.chat.completions.create(**openai_kwargs)
openai_response.choices[0].message.content
'The first month of the year is January.'
tlm_result = tlm.score(**openai_kwargs, response=openai_response)
print("LLM response: ", tlm_result["response"]["chat_completion"]["choices"][0]["message"]["content"])
print("Trustworthiness score: ", tlm_result["trustworthiness_score"])
LLM response: The first month of the year is January. Trustworthiness score: 0.9916666343978949
And TLM returns a low score when your LLM's reponse is untrustworthy, either because it is incorrect/unhelpful or the model is highly uncertain:
# manaully edit the response to be incorrect
openai_response.choices[0].message.content = "The first month of the year is February."
tlm_result = tlm.score(**openai_kwargs, response=openai_response)
print("LLM response: ", tlm_result["response"]["chat_completion"]["choices"][0]["message"]["content"])
print("Trustworthiness score: ", tlm_result["trustworthiness_score"])
LLM response: The first month of the year is February. Trustworthiness score: 3.226877178426809e-08
TLM.score() helps you add trustworthiness scoring to any LLM application without changing your existing code.
TLM.generate() helps you simultaneously generate and score LLM responses.
TLM runs on top of a base LLM model (OpenAI's gpt-4.1-mini by default). For faster/better TLM results, specify a faster/better base model than the default.
How to use these trust scores for reliable AI?¶
Offline, you can manually review the lowest-trust LLM responses across a dataset and discover insights to improve your LLM prompts.
In real-time, you can automatically determine which LLM responses are untrustworthy by comparing trustworthiness scores against a fixed threshold (say 0.7). The overall magnitude of trust scores may differ between applications, so select application-specific thresholds.
For maximally reliable AI applications, you can escalate untrustworthy LLM responses for human review.
Here are other strategies to automatically handle untrustworthy LLM responses without a human-in-the-loop:
- Append a warning message/disclaimer to the response.
- Replace your LLM response with a fallback message such as: "Sorry I am unsure. Try rephrasing your request, or contact us".
- In RAG, the fallback message might include raw retrieved context or search-results, for example: "Sorry I am unsure. Here's some potentially relevant information: ...".
- Replace your original LLM response with a re-generated response.
- Escalate to a more expensive AI system (e.g. DeepResearch API).
Below, we showcase example implementations of these strategies.
Append disclaimer to untrustworthy responses¶
One straightforward strategy is to still present untrustworthy LLM responses to your user, but first edit them to make them less misleading. You could append a cautionary warning after the response:
if trustworthiness_score < threshold: # say 0.7
response = response + "\n\n CAUTION: This answer was flagged as potentially untrustworthy."
Or you could append a hedging statement before the response, making it sound less confident:
if trustworthiness_score < threshold: # say 0.7
response = "I'm not sure, but I'd guess:\n\n" + response