Scoring Structured Output Responses with TLM¶
This tutorial demonstrates how to score the trustworthiness of structred outputs (JSON etc) using TLM.
Setup¶
This tutorial requires an API key for an LLM provider. Some possibilities include: OPENAI_API_KEY, GEMINI_API_KEY, DEEPSEEK_API_KEY, AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY, etc.
The TLM Python client can be installed using pip:
%pip install --upgrade trustworthy-llm
# Set your API key
import os
os.environ["OPENAI_API_KEY"] = "<INSERT_API_KEY>" # or other LLM provider API key
from pydantic import create_model
from typing import Optional
from tlm import TLM
PII Extraction Use Case¶
This tutorial showcases a PII (Personally Identifiable Information) extraction example.
Each text sample contains various types of personal information embedded within natural language text. The task is to extract different categories of PII from the text. Each example contains multiple types of PII that need to be identified and classified into specific categories including names (FIRSTNAME, LASTNAME), dates (DATE), and account numbers (ACCOUNTNUMBER).
Let’s take a look at a few samples below:
input_texts = [
"Dear Orland, we are pleased to inform you that your scholarship grant of Danish Krone39k will be transferred to your account KZ551736040717OKI15Z shortly.",
"In relation to the filed litigation, we hereby request full disclosure of all data logs associated with 242.218.157.166 and cd9f:d9e5:1ceb:fd39:b2d7:f3fd:c9cd:c27b tied to the account of Fleta London Emard. This involves her employment account Investment Account with Gutkowski Inc.",
"We would like to do a follow-up meeting with Sierra Green regarding her recent surgery. The proposed date is August 13, 2013 at our clinic in West Nash.",
"Melvin, the password of your study support account has been changed to Xnjv7nCydECf for security purposes. Please update it promptly.",
"Is your business tax-ready? Our team in Novato is here to help you navigate through Martinique's complex tax rules. Contact us at 56544500.",
"To: Maximillian Noah Moore, we forgot to update your record with phone IMEI: 30-265288-033265-8. Could you please provide it in your earliest convenience to keep your records updated.",
]
print(input_texts[0])
Dear Orland, we are pleased to inform you that your scholarship grant of Danish Krone39k will be transferred to your account KZ551736040717OKI15Z shortly.
Obtain LLM Predictions¶
Define Structured Output Schema¶
We know that the 4 PII fields that we want to extract are: ['FIRSTNAME', 'LASTNAME', 'DATE', 'ACCOUNTNUMBER']
Using that, we can create a Pydantic model to represent our PII extraction schema. Each field is optional and can be None if that entity type is not found in the text:
pii_entities = ["FIRSTNAME", "LASTNAME", "DATE", "ACCOUNTNUMBER"]
fields = {name: (Optional[str], None) for name in pii_entities}
PII = create_model("PII", **fields)
Prompt TLM for responses and trust scores¶
tlm = TLM()
sample_text = input_texts[0]
openai_kwargs = {
"model": "gpt-4.1-mini",
"messages": [
{
"role": "user",
"content": f"Extract PII information from the following text, return null if the entity is not found: {sample_text}",
}
],
"response_format": PII,
}
tlm_result = tlm.create(**openai_kwargs)
The returned object matches what any LLM would ordinarily return, except it has an additional trustworthiness_score field from TLM with extra information like the trustworthiness score.
print("LLM response: ", tlm_result["response"].choices[0].message.content)
print("Trustworthiness score: ", tlm_result["trustworthiness_score"])
print(f"Per-field Trustworthiness Scores: {tlm_result['metadata']['per_field_score']}")
LLM response: {'FIRSTNAME': 'Orland', 'LASTNAME': None, 'DATE': None, 'ACCOUNTNUMBER': 'KZ551736040717OKI15Z'}
Trustworthiness score: 0.9800000000000001
Per-field Trustworthiness Scores: {'FIRSTNAME': {'score': 0.999, 'explanation': "The text addresses 'Orland', which is a valid extracted first name."}, 'LASTNAME': {'score': 0.999, 'explanation': "No last name is mentioned in the text, so 'null' is appropriate."}, 'DATE': {'score': 0.999, 'explanation': "No date is provided in the text, so returning 'null' is correct."}, 'ACCOUNTNUMBER': {'score': 0.999, 'explanation': "The account number 'KZ551736040717OKI15Z' is explicitly mentioned in the text and correctly extracted."}}
Run a dataset of many examples¶
Here, we define a quick helper function that allows us to process multiple text samples in parallel, which will speed up prompting the LLM over a dataset. The helper function also collects the LLM outputs and trustworthiness score in a formatted DataFrame for easy downstream analysis.
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import time
from tqdm import tqdm
def extract_pii(text):
tlm = TLM()
openai_kwargs = {
"model": "gpt-4.1-mini",
"messages": [
{
"role": "user",
"content": f"Extract PII information from the following text, return null if the entity is not found: {text}",
}
],
"response_format": PII,
}
tlm_result = tlm.create(**openai_kwargs)
return {
"raw_completion": tlm_result,
# the columns below extract the PII information and scores from the raw OpenAI response
"extracted_pii": tlm_result["response"].choices[0].message.content,
"trustworthiness_score": tlm_result["trustworthiness_score"],
"per_field_score": tlm_result["metadata"]["per_field_score"],
}
def extract_pii_batch(texts, batch_size=15, max_threads=8, sleep_time=2):
results = []
for i in tqdm(range(0, len(texts), batch_size)):
batch = texts[i : i + batch_size]
with ThreadPoolExecutor(max_threads) as executor:
futures = [executor.submit(extract_pii, text) for text in batch]
batch_results = [f.result() for f in futures]
results.extend(batch_results)
# sleep to prevent hitting rate limits
if i + batch_size < len(texts):
time.sleep(sleep_time)
return pd.DataFrame(results)
results = extract_pii_batch(input_texts)
results.head(2)
100%|██████████| 1/1 [00:05<00:00, 5.84s/it]
| raw_completion | extracted_pii | trustworthiness_score | per_field_score | |
|---|---|---|---|---|
| 0 | {'response': ModelResponse(id='chatcmpl-DBwRlW... | {'FIRSTNAME': 'Orland', 'LASTNAME': None, 'DAT... | 0.970000 | {'FIRSTNAME': {'score': 0.999, 'explanation': ... |
| 1 | {'response': ModelResponse(id='chatcmpl-DBwRla... | {'FIRSTNAME': 'Fleta London', 'LASTNAME': 'Ema... | 0.181196 | {'FIRSTNAME': {'score': 0.13961111111111107, '... |
Examine Results¶
We’ve now generated structured ouputs (i.e. extracted data) for each text sample in the dataset and scored the trustworthiness of each output.
pd.set_option("display.max_colwidth", None)
results["input_text"] = input_texts
High Trustworthiness Scores¶
The responses with the highest trustworthiness scores represent texts where TLM is most confident in the accuracy of your LLM’s structured outputs.
Looking at the examples below with high trustworthiness scores, we can see that the model successfully extracted the correct PII elements in these text samples:
results.sort_values("trustworthiness_score", ascending=False).head(2)[
["input_text", "extracted_pii", "trustworthiness_score"]
]
| input_text | extracted_pii | trustworthiness_score | |
|---|---|---|---|
| 0 | Dear Orland, we are pleased to inform you that your scholarship grant of Danish Krone39k will be transferred to your account KZ551736040717OKI15Z shortly. | {'FIRSTNAME': 'Orland', 'LASTNAME': None, 'DATE': None, 'ACCOUNTNUMBER': 'KZ551736040717OKI15Z'} | 0.97 |
| 3 | Melvin, the password of your study support account has been changed to Xnjv7nCydECf for security purposes. Please update it promptly. | {'FIRSTNAME': 'Melvin', 'LASTNAME': None, 'DATE': None, 'ACCOUNTNUMBER': None} | 0.97 |
Low Trustworthiness Scores¶
The lowest trustworthiness scores reveal the LLM outputs that TLM is least confident are accurate. Documents/results with low trustworthiness scores would benefit most from manual review, especially if we need almost all outputs across the dataset to be correct and want to save human review costs.
The LLM outputs with the lowest trustworthiness scores in this dataset are shown below, and these extractions are often incorrect or ambiguous warranting further review.
results.sort_values("trustworthiness_score").head(2)[["input_text", "extracted_pii", "trustworthiness_score"]]
| input_text | extracted_pii | trustworthiness_score | |
|---|---|---|---|
| 1 | In relation to the filed litigation, we hereby request full disclosure of all data logs associated with 242.218.157.166 and cd9f:d9e5:1ceb:fd39:b2d7:f3fd:c9cd:c27b tied to the account of Fleta London Emard. This involves her employment account Investment Account with Gutkowski Inc. | {'FIRSTNAME': 'Fleta London', 'LASTNAME': 'Emard', 'DATE': None, 'ACCOUNTNUMBER': 'Investment Account'} | 0.181196 |
| 5 | To: Maximillian Noah Moore, we forgot to update your record with phone IMEI: 30-265288-033265-8. Could you please provide it in your earliest convenience to keep your records updated. | {'FIRSTNAME': 'Maximillian', 'LASTNAME': 'Noah Moore', 'DATE': None, 'ACCOUNTNUMBER': None} | 0.372823 |
Obtaining Trust Scores for Individual Fields¶
Beyond TLM’s overall trustworthiness score, you can obtain granular confidence scores for each individual field in the structured output from your LLM. These field-level scores help you pinpoint which specific values may be incorrect or warrant focused review.
Let’s look at the text sample receiving the lowest trustworthiness score in this dataset:
lowest_scoring_text = results.loc[results["trustworthiness_score"].idxmin()]
print(f"Text: {lowest_scoring_text['input_text']}")
print(f"Extracted PII Information: {lowest_scoring_text['extracted_pii']}")
print(f"Trustworthiness Score: {lowest_scoring_text['trustworthiness_score']}")
print(f"Per-field Trustworthiness Scores: {lowest_scoring_text['per_field_score']}")
Text: In relation to the filed litigation, we hereby request full disclosure of all data logs associated with 242.218.157.166 and cd9f:d9e5:1ceb:fd39:b2d7:f3fd:c9cd:c27b tied to the account of Fleta London Emard. This involves her employment account Investment Account with Gutkowski Inc.
Extracted PII Information: {'FIRSTNAME': 'Fleta London', 'LASTNAME': 'Emard', 'DATE': None, 'ACCOUNTNUMBER': 'Investment Account'}
Trustworthiness Score: 0.18119641830134262
Per-field Trustworthiness Scores: {'FIRSTNAME': {'score': 0.13961111111111107, 'explanation': "The text specifies the full name as 'Fleta London Emard', suggesting 'Fleta' is the first name and 'London' is likely a middle name. The response incorrectly includes 'Fleta London' as the first name, which is not accurate."}, 'LASTNAME': {'score': 0.999, 'explanation': "The last name 'Emard' is clearly stated and correctly extracted from the text."}, 'DATE': {'score': 0.999, 'explanation': 'There is no date mentioned anywhere in the text, so returning null is appropriate and correct.'}, 'ACCOUNTNUMBER': {'score': 0.001, 'explanation': "The text mentions 'Investment Account' as the account type associated with the person; however, 'Investment Account' is not an account number, it's just a descriptor. Since no account number is given, returning 'Investment Account' as account number is inaccurate."}}
The per_field_score dictionary contains a granular confidence score and explanation for each extracted field. Since this dictionary can be overwhelming, we provide a get_untrustworthy_fields() method that:
- Prints detailed information about low-confidence fields
- Returns a list of fields that may need manual review due to low trust scores
untrustworthy_fields = tlm.get_untrustworthy_fields(tlm_result=lowest_scoring_text["raw_completion"])
Untrustworthy fields: ['ACCOUNTNUMBER', 'FIRSTNAME'] Field: ACCOUNTNUMBER Response: Investment Account Score: 0.001 Explanation: The text mentions 'Investment Account' as the account type associated with the person; however, 'Investment Account' is not an account number, it's just a descriptor. Since no account number is given, returning 'Investment Account' as account number is inaccurate. Field: FIRSTNAME Response: Fleta London Score: 0.13961111111111107 Explanation: The text specifies the full name as 'Fleta London Emard', suggesting 'Fleta' is the first name and 'London' is likely a middle name. The response incorrectly includes 'Fleta London' as the first name, which is not accurate.
This method returns a list of fields whose confidence score is low, allowing you to focus manual review on the specific fields whose extracted value is untrustworthy.
untrustworthy_fields
['ACCOUNTNUMBER', 'FIRSTNAME']