Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Last active October 16, 2024 08:01
Show Gist options
  • Save dannguyen/faaa56cebf30ad51108a9fe4f8db36d8 to your computer and use it in GitHub Desktop.
Save dannguyen/faaa56cebf30ad51108a9fe4f8db36d8 to your computer and use it in GitHub Desktop.
A basic test of OpenAI's Structured Output feature against financial disclosure reports and a newspaper's police blotter. Code examples use the Python SDK and pydantic for the schema definition.

Extracting financial disclosure reports and police blotter narratives using OpenAI's Structured Output

tl;dr this demo shows how to call OpenAI's gpt-4o-mini model, provide it with URL of a screenshot of a document, and extract data that follows a schema you define. The results are pretty solid even with little effort in defining the data — and no effort doing data prep. OpenAI's API could be a cost-efficient tool for large scale data gathering projects involving public documents.

OpenAI announced Structured Outputs for its API, a feature that allows users to specify the fields and schema of extracted data, and guarantees that the JSON output will follow that specification.

For example, given a Congressional financial disclosure report, with assets defined in a table like this:

image

You define the data model you're expecting to extract, either in JSON schema or (as this demo does) via the pydantic library:

class Asset(BaseModel):
    asset_name: str
    owner: str
    location: Union[str, None]
    asset_value_low: Union[int, None]
    asset_value_high: Union[int, None]
    income_type: str
    income_low: Union[int, None]
    income_high: Union[int, None]
    tx_gt_1000: bool

class DisclosureReport(BaseModel):
    assets: list[Asset]

OpenAI's API infers from the field names (the above example is basic; there are ways to provide detailed descriptions for each data field) how your data model relates to the actual document you're trying to parse, and produces the extracted data in JSON format:

{
      "asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
      "owner": "JT",
      "location": "St. Helena/Napa, CA, US",
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "Grape Sales",
      "income_low": 100001,
      "income_high": 1000000,
      "tx_gt_1000": false
    },
    {
      "asset_name": "25 Point Lobos - Commercial Property [RP]",
      "owner": "SP",
      "location": "San Francisco/San Francisco, CA, US",
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "Rent",
      "income_low": 100001,
      "income_high": 1000000,
      "tx_gt_1000": false
}

This demo gist provides code and results for two scenarios:

  • Financial disclosure reports: this is a data-tables-in-PDF problem where you'd typically have to use a PDF parsing library like pdfplumber and write your own data parsing methods.
  • Newspaper police blotter: this is a situation of irregular information — brief descriptions of reported crime incidents, written by a human reporter — where you'd employ humans to read, interpret, and do data entry.

Note: these are very basic examples, using the bare minimum of instructions to the API (e.g. "Extract the text from this image") and relatively little code to define the expected data schema. That said

How to run this code/use this demo

Each example has the Python script used to produce the corresponding JSON output. To re-run these scripts on your own, the first thing you need to do is to create your own OpenAI developer account at platform.openai.com, then:

Then install the OpenAI Python SDK and pydantic:

pip install openai pydantic

For ease of use, these scripts are set up to use gpt-4o-mini's vision capabilities to ingest PNG files via web URLs. If you want to modify the script to test a URL of your choosing, simply modify the INPUT_URL variable at the top of the script.

Scanned financial disclosure



Financial disclosure report

The following screenshot is taken from the PDF of the full report, which can be found at disclosures-clerk.house.gov). Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.

image

As shown in the following snippet, the results look accurate and as expected. Note that it also correctly parses the "Location" and "Description" fields (when it exists), even though those fields aren't provided in tabular format (i.e. they're globbed into the "Asset" description as free form text).

It also understands that tx_gt_1000 corresponds to the Tx. > $1,000? header, and that that field contains checkboxes. Even though the sample page has no examples of checked checkboxes, the model correctly infers that tx_gt_1000 is false.

image
    {
      "asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
      "owner": "OL",
      "location": "New York, NY, US",
      "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.",
      "asset_value_low": 1000001,
      "asset_value_high": 5000000,
      "income_type": "Partnership Income",
      "income_low": 50001,
      "income_high": 100000,
      "tx_gt_1000": false
    },
    {
      "asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
      "owner": "SP",
      "location": null,
      "description": null,
      "asset_value_low": 5000001,
      "asset_value_high": 25000000,
      "income_type": "None",
      "income_low": null,
      "income_high": null,
      "tx_gt_1000": false
    },

It's also nice that I didn't have to do even the minimum of "data prep": I gave it a screenshot of the report page — the top third of which has info I don't need — and it "knew" that it should only care about the data under the "Schedule A: Assets and Unearned Income" header.

If I were scraping financial disclosures for real, I would make use of json-schema's "description" attribute, which can be defined via Pydantic like this:

from pydantic import BaseModel, Field

class Asset(BaseModel):
    asset_name: str = Field(
        description="The name of the asset, under the 'Asset' header"
    )
    owner: str = Field(
        description="Under the 'Owner' header, a 2-letter abbreviation, e.g. SP, DC, JT"
    )
    location: Union[str, None] = Field(
        description="Some records have 'Location:' text as part of the 'Asset' header"
    )
    description: Union[str, None] = Field(
        description="Some records have 'Description:' text as part of the 'Asset' header"
    )
    asset_value_low: Union[int, None] = Field(
        description="Under the 'Value of Asset' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
    )
    asset_value_high: Union[int, None] = Field(
        description="Under the 'Value of Asset' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
    )
    income_type: str = Field(description="Under the 'Income Type(s) field")
    income_low: Union[int, None] = Field(
        description="Under the 'Income' field, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
    )
    income_high: Union[int, None] = Field(
        description="Under the 'Income' field, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
    )
    tx_gt_1000: bool = Field(
        description="Under the 'Tx. > $1,000?' header: True if the checkbox is checked, False if it is empty"
    )


class DisclosureReport(BaseModel):
    assets: list[Asset]

But as you can see from the result JSON, OpenAI's model seems "smart" enough to understand a basic data-copying task without specific instructions.

Financial disclosure report with no instruction

I was curious how well the model without any instruction, i.e. when you don't bother to define a pydantic model and instead pass in a response format of {"type": "json_object"}:

response = client.beta.chat.completions.parse(
    response_format={"type": "json_object"},
    model="gpt-4o-mini",
    messages=input_messages
)

The answer: just fine. You can see the code and full results here:

Without a defined schema, the model treated the entire document (not just the Assets Schedule) as data:

{
  "document": {
    "title": "Financial Disclosure Report",
    "header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515",
    "filer_information": {
      "name": "Hon. Nancy Pelosi",
      "status": "Member",
      "state_district": "CA11"
    },
    "filing_information": {
      "filing_type": "Annual Report",
      "filing_year": "2023",
      "filing_date": "05/15/2024"
    },
    "schedule_a": {
      "title": "Schedule A: Assets and 'Unearned' Income",
      "assets": [
        {
          "asset": "11 Zinfandel Lane - Home & Vineyard [RP]",
          "owner": "JT",
          "value": "$5,000,001 - $25,000,000",
          "income_type": "Grape Sales",
          "income": "$100,001 - $1,000,000",
          "location": "St. Helena/Napa, CA, US"
        },

It left the values as text, e.g. "value": "$5,000,001 - $25,000,000" versus "asset_value_low": 5000001. And it left out the optional data fields, e.g. location and description, for entries that didn't have them:

        {
          "asset": "AllianceBernstein Holding L.P. Units (AB) [OL]",
          "owner": "SP",
          "value": "$1,000,001 - $5,000,000",
          "income_type": "Partnership Income",
          "income": "$50,001 - $100,000",
          "location": "New York, NY, US",
          "description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors."
        },
        {
          "asset": "Alphabet Inc. - Class A (GOOGL) [ST]",
          "owner": "SP",
          "value": "$5,000,001 - $25,000,000",
          "income_type": "None"
        },

Scanned financial disclosure

As I said at the beginning of this section, the report screenshot comes from a PDF with actual text — most Congressional disclosure filings in the past 5 years have used the e-filing system, which inherently results in more regular data even when the output is PDF.

So I tried using Structured Outputs on a screenshot of a 2008-era report, and the results were pretty solid.

image

The main caveat is that I had to rotate the page orientation by 90 degrees. The model did try to parse the vertically-orientated page, and got about half of the values right — which is probably one of the worst-case scenarios (you'd prefer the model to completely flub things, so that you could at least catch with automated-error checks)



Newspaper police blotter

image

The screenshot was taken from the Stanford Daily archives: https://archives.stanforddaily.com/2004/04/09?page=3&section=MODSMD_ARTICLE12#article

For reasons that are explained in detail below, this example isn't meant to be a reasonable test of the model capabilities. But it's a fun experiment to see how well its model performs with something not meant to be "data" and is inherently riddled with data quality issues.

Consider what the data point of a basic crime incident report might contain:

  • When: a date and time
  • Where: a place
  • Who:
    • a victim
    • a suspect
  • What: the crime the suspect allegedly committed

It's easy to come up with many variations and edge cases:

  • No specific time: i.e. "computer science graduate students reported that they had books stolen from the Gates Computer Science Building in the previous five months"
  • No listed place: it's unclear if the reporter purposefully omitted it, or if it was left off the original police report.
  • No suspect ("an alcohol-related medical call") or no victim (e.g. "an accidental fire call"). Or multiple suspects and multiple victims.

Unlike the financial disclosure example, the input data is freeform narrative text. The onus is entirely on us to define what what a blotter report is, which ends up requiring defining what a crime incident is. Not surprisingly, the corresponding Pydantic code is a lot more verbose, and I bet if you asked 1,000 journalists to write a definition, they'd all be different.

Here's what mine looks like:

# Define the data structures in Pydantic:
# an Incident involves several Persons (victims, perpetrators)
class Person(BaseModel):
    description: str
    gender: str
    is_student: bool


# Pydantic docs on field descriptions:
# https://docs.pydantic.dev/latest/concepts/fields/
class Incident(BaseModel):
    date: str
    time: str
    location: str
    summary: str = Field(description="""Brief summary, less than 30 chars""")
    category: str = Field(
        description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
    )
    property_damage: str = Field(
        description="""If a property crime, then a description of what was stolen/damaged/lost"""
    )
    arrest_made: bool
    perpetrators: list[Person]
    victims: list[Person]
    incident_text: str = Field(
        description="""Include the complete verbatim text from the input that pertains to the incident"""
    )

class Blotter(BaseModel):
    incidents: list[Incident]

Police blotter results

I ask the model to provide an incident_text field, i.e. the verbatim text from which it extracted the incident data point. This is helpful for evaluating the experiment. But for an actual data project, you might want to omit it as it adds to the number of output tokens and API cost

    incident_text: str = Field(
        description="""Include the complete verbatim text from the input that pertains to the incident"""
    )
image

The resulting incident_text field extracted from the above snippet is basically correct:

A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.

However, it leaves off the 11:40 p.m., which is at the beginning of the printed incident, and is something that I normally would like to include because I want to know everything the model looked at when extracting the data point.

The 11:40 p.m. time is correctly included in the rest of the data output:

{
  "date": "April 2",
  "time": "11:40 p.m.",
  "location": "Rains apartments",
  "summary": "Bike vandalized",
  "category": "property",
  "property_damage": "Wheel of bike",
  "arrest_made": false,
  "perpetrators": [
    {
      "description": "Two unknown suspects",
      "gender": "unknown",
      "is_student": false
    }
  ],
  "victims": [
    {
      "description": "A graduate student in the School of Education",
      "gender": "unknown",
      "is_student": true
    }
  ],
  "incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
}

The good

As with the financial disclosure report, my script provides a screenshot and leaves it up to OpenAI's model to figure out what's going on. I was pleasantly surprised at how well gpt-4o-mini did in gleaning structure from a newspaper print listicle, with instructions as basic as: "Extract the text from this image"

For example, on first glance of the blotter, it seems that every incident has a date (in the subhed) and time (at the beginning of the graf). But under "Thursday, April 1", you can see that pattern already broken:

image

Is that second graf ("A female administrator in Materials Science...") a continuation of the 9:30 p.m. incident where a "man reported that someone removed his rear license plate"?

Most human readers, after reading both paragraphs — and then the rest of the blotter — will realize that these are 2 separate incidents. But there's nothing at all in the structure of the text to indicate that. Before I ran this experiment, I thought I would have to provide detailed parsing instructions to the model, e.g.

What you are reading is a police blotter, a list of reported incidents that police were called to. Every paragraph should be treated as a separate incident. Most incidents, but not all, begin with a timestamp, e.g. "11:20 p.m".

But the model saw on its own that there are 2 incidents, and that the second one happened on April 1 at an unspecified time.

 {
      "date": "April 1",
      "time": "9:30 p.m.",
      "location": "Toyon parking lot",
      "summary": "License plate stolen",
      "category": "property",
      "property_damage": "rear license plate",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "Man",
          "gender": "unknown",
          "is_student": false
        }
      ],
      "incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot."
    },
    {
      "date": "April 1",
      "time": "unknown",
      "location": "unknown",
      "summary": "Unauthorized purchase reported",
      "category": "other",
      "property_damage": "computer equipment",
      "arrest_made": false,
      "perpetrators": [],
      "victims": [
        {
          "description": "Female administrator",
          "gender": "female",
          "is_student": false
        }
      ],
      "incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry\u2019s Electronics sometime in the past five months."
    },

By my count, there are 19 incidents in this issue of the Stanford Daily's police blotter, and the API correctly returns 19 different incidents.

The bad

Again, the data model is inherently messy, and I put in minimal effort to describe what an "incident" is, such as the variety of situations and edge cases. That, plus the inherent limitations of the data, are the root cause of most of the model's problems.

For example, I intended the perpetrators and victims to be lists of proper nouns or simple nouns, so that we could ask questions like: "how many incidents involved multiple people". Given the following incident text:

A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.

— this is how the model parsed the suspects:

 "perpetrators": [
    {
      "description": "Two unknown suspects",
      "gender": "unknown",
      "is_student": false
    }
  ]

For a data project, I might have preferred a result that would easily return a result of 2, e.g.:

 "perpetrators": [
    {
      "description": "Unknown suspect",
      "gender": "unknown",
      "is_student": false
    }
    {
      "description": "Unknown suspect",
      "gender": "unknown",
      "is_student": false
    }
  ]

But how should the model know what I'm trying to do sans specific instructions? I think most humans, given the same minimalist instructions, would have also recorded "Two unknown suspects".

However, the model greatly struggled with filling out the perpetrators and victims lists, such as frequently mistaking the suspect/perpetrator as the victim, when there was no specific victim mentioned:

A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license.

      "victims": [
        {
          "description": "A male undergraduate",
          "gender": "male",
          "is_student": true
        }
      ]

It goes without saying that the model missed when the narrative was more complicated. For example, in the case of the unauthorized purchases at Fry's:

A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months.

The "female administrator" is not the victim, but the person who reported the crime. The victim would be Stanford University, or more specifically, its MScE department.

I'm not surprised the model had problems with identifying victims and suspects, though I'm unsure how much extra instruction would be needed to get reliable results from a general model.

One thing that the model frequently and inexplicably erred on was classifying people's gender.

This is how I defined a Person using pydantic:

class Person(BaseModel):
    description: str
    gender: str
    is_student: bool

Even when the subject's noun has an obvious gender, the model would inexplicably flub it:

A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot.

      "victims": [
        {
          "description": "A man",
          "gender": "unknown",
          "is_student": false
        }
      ]

It was worse when the subject's noun did not indicate gender, but the rest of the sentence did:

A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments.

"victims": [
        {
          "description": "A graduate student in the School of Education",
          "gender": "unknown",
          "is_student": true
        }
      ],

Not sure what the issue is. It might be remedied if I provided explicit and thorough instructions and examples, but this seemed like a much easier thing to infer than the other things that OpenAI's model was able to infer on its own.

The weird

With so many things left to the interpretation of the LLM, it was no surprise that I get different results every time I run the extract-police-blotter.py script, especially when it comes to the categorization of crimes.

In the data specification, I did attempt to describe for the model what I wanted for category:

category: str = Field(
    description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
)

Given the option of saying "other", the model seemed eager to use it for any slightly vague situation. It classified the unauthorized purchases at Fry's as "other", even though embezzlement would better fit under property crimes by the FBI's UCR definition. Maybe this could be fixed by providing the model with detailed examples and definitions of statutes and criminal code?

But ultimately, as I said from the start, the model's performance is bounded by the limitations and errors in the source data. For example, an incident where someone gets hit on the head with a bottle seems to me obviously "violent", i.e. assault:

A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested.

However, the model thinks it is "other":

{
      "date": "April 4",
      "time": "3:05 a.m.",
      "location": "Sigma Alpha Epsilon",
      "summary": "Altercation reported",
      "category": "other",
      "property_damage": "None",
      "arrest_made": false,
      "perpetrators": [
        {
          "description": "Two undergraduate suspects",
          "gender": "unknown",
          "is_student": true
        }
      ],
      "victims": [
        {
          "description": "A male undergraduate",
          "gender": "male",
          "is_student": true
        }
      ],
      "incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested."
    }

But is the model necessarily wrong? Two "suspects" were apparently identified, but no one was actually arrested. I took this to mean that the suspects fled and hadn't been located at the time of the report. But maybe it's something more benign: an "altercation" happened, but when the cops arrived, everyone was cool including the guy who got hit by the bottle, thus no allegation of assault for police to act on or file as part of their UCR statistics. Ultimately we have to guess the author's intent.

OpenAI model's performance here wouldn't work for a real data project — but again, this was just a toy experiment, and doesn't represent what you'd get if you spend more than 10 minutes thinking about the data model, nevermind pick a data source slightly more structured than a newspaper listicle. I think OpenAI's model would work very well for something with more substantive text and formal structure, such as obituaries.

#!/usr/bin/env python3
"""
extract-basic-financial-disclosure.py
Parses and extracts structured data — and lets the model infer the structure by itself —
from the screenshot at the given URL:
https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504
Full financial disclosure report:
https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf
This script assumes your API key is set up in the default way,
i.e. environment variable: $OPENAI_API_KEY
https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
"""
import base64
import json
from openai import OpenAI
from pathlib import Path
from typing import Union
INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"
## initialize OpenAI client
client = OpenAI()
# Example of message format for passing in an image via URL
# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
input_messages = [
{"role": "system", "content": "Output the result in JSON format."},
{
"role": "user",
"content": [
{"type": "text", "text": "Extract the text from this image"},
{
"type": "image_url",
"image_url": {"url": INPUT_URL},
},
],
},
]
# we are letting the model infer the data structure by itself
# but we still need to tell it to respond in JSON, hence
# response_format={"type": "json_object"}
response = client.beta.chat.completions.parse(
response_format={"type": "json_object"},
model="gpt-4o-mini",
messages=input_messages
)
message = response.choices[0].message
# Print it out in readable format
obj = json.loads(message.content)
print(json.dumps(obj, indent=2))
#!/usr/bin/env python3
"""
extract-financial-disclosure.py
Parses and extracts structured data from the screenshot at the given URL:
https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504
Full financial disclosure report:
https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2023/10059734.pdf
This script assumes your API key is set up in the default way,
i.e. environment variable: $OPENAI_API_KEY
https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
"""
import base64
import json
from openai import OpenAI
from pathlib import Path
from pydantic import BaseModel, Field
from typing import Union
INPUT_URL = "https://gist.github.com/user-attachments/assets/9c35e7a4-e6b7-4d5b-a4a2-a62b6ec28504"
# OpenAI examples of Stuctured Output scripts and data definitions
# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2
# Define the data structures in Pydantic:
# a Disclosure Report has a list of assets
class Asset(BaseModel):
asset_name: str
owner: str
location: Union[str, None]
asset_value_low: Union[int, None]
asset_value_high: Union[int, None]
income_type: str
income_low: Union[int, None]
income_high: Union[int, None]
tx_gt_1000: bool
class DisclosureReport(BaseModel):
assets: list[Asset]
## initialize OpenAI client
client = OpenAI()
# Example of message format for passing in an image via URL
# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
input_messages = [
{"role": "system", "content": "Output the result in JSON format."},
{
"role": "user",
"content": [
{"type": "text", "text": "Extract the text from this image"},
{
"type": "image_url",
"image_url": {"url": INPUT_URL},
},
],
},
]
# gpt-4o-mini is cheap and fast and has vision capabilities
response = client.beta.chat.completions.parse(
response_format=DisclosureReport,
model="gpt-4o-mini",
messages=input_messages
)
message = response.choices[0].message
# Print it out in readable format
obj = json.loads(message.content)
print(json.dumps(obj, indent=2))
#!/usr/bin/env python3
"""
extract-police-blotter.py
Parses and extracts structured data from the screenshot at the given URL:
https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703
This script assumes your API key is set up in the default way,
i.e. environment variable: $OPENAI_API_KEY
https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
"""
import base64
import json
from openai import OpenAI
from pathlib import Path
from pydantic import BaseModel, Field
INPUT_URL = "https://gist.github.com/user-attachments/assets/ceb6db99-e884-4566-bea8-c48b415a5703"
# OpenAI examples of Stuctured Output scripts and data definitions
# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2
# Define the data structures in Pydantic:
# an Incident involves several Persons (victims, perpetrators)
class Person(BaseModel):
description: str
gender: str
is_student: bool
# Pydantic docs on field descriptions:
# https://docs.pydantic.dev/latest/concepts/fields/
class Incident(BaseModel):
date: str
time: str
location: str
summary: str = Field(description="""Brief summary, less than 30 chars""")
category: str = Field(
description="""Type of report, broadly speaking: "violent" , "property", "traffic", "call for service", or "other" """
)
property_damage: str = Field(
description="""If a property crime, then a description of what was stolen/damaged/lost"""
)
arrest_made: bool
perpetrators: list[Person]
victims: list[Person]
incident_text: str = Field(
description="""Include the complete verbatim text from the input that pertains to the incident"""
)
class Blotter(BaseModel):
incidents: list[Incident]
## done defining the data structures
##################################################
## initialize OpenAI client
client = OpenAI()
# Example of message format for passing in an image via URL
# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
input_messages = [
{"role": "system", "content": "Output the result in JSON format."},
{
"role": "user",
"content": [
{"type": "text", "text": "Extract the text from this image"},
{
"type": "image_url",
"image_url": {"url": INPUT_URL},
},
],
},
]
# gpt-4o-mini is cheap and fast and has vision capabilities
response = client.beta.chat.completions.parse(
response_format=Blotter,
model="gpt-4o-mini",
messages=input_messages
)
message = response.choices[0].message
# Print it out in readable format
obj = json.loads(message.content)
print(json.dumps(obj, indent=2))
#!/usr/bin/env python3
"""
extract-financial-disclosure.py
Parses and extracts structured data from the screenshot at the given URL:
https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6
The page comes from page 5 of 24; Schedule III, of the full financial
disclosure report found here:
https://gist.github.com/user-attachments/assets/e430e76a-2519-43fa-a370-85a584b816b6
(the page was manually rotated 90 degrees from its original orientation in the scanned document)
This script assumes your API key is set up in the default way,
i.e. environment variable: $OPENAI_API_KEY
https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
"""
import base64
import json
from openai import OpenAI
from pathlib import Path
from pydantic import BaseModel, Field
from typing import Union, Literal
INPUT_URL = "https://gist.github.com/user-attachments/assets/52c5c8f5-886f-45fe-a338-d1cd3e36ecc8"
# OpenAI examples of Stuctured Output scripts and data definitions
# https://platform.openai.com/docs/guides/structured-outputs/examples?context=ex2
# Define the data structures in Pydantic:
# a Disclosure Report has a list of assets
class Asset(BaseModel):
owner: Union[Literal['SP', 'DC', 'JT'], None] = Field(description="The leftmost first column of the table")
asset_name: str = Field(
description="The name of the asset, the second column of the table"
)
asset_value_low: Union[int, None] = Field(
description="In the third column, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'"
)
asset_value_high: Union[int, None] = Field(
description="In the third column, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'"
)
income_type: str = Field(description="The fourth column")
income_low: Union[int, None] = Field(
description="In the 5th column, the left value of the string and converted to an integer, e.g. '15001' from '$15,001 - $50,000'. If the value is enclosed in parentheses, then the income values are meant to be negative"
)
income_high: Union[int, None] = Field(
description="In the 5th column, the right value of the string and converted to an integer, e.g. '50000' from '$15,001 - $50,000'. If the value is enclosed in parentheses, then the income values are meant to be negative"
)
transaction_type: Union[Literal['P', 'S', 'E'], None]
class DisclosureReport(BaseModel):
assets: list[Asset]
## initialize OpenAI client
client = OpenAI()
# Example of message format for passing in an image via URL
# https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o#url-image-processing
input_messages = [
{"role": "system", "content": "Output the result in JSON format."},
{
"role": "user",
"content": [
{"type": "text", "text": "Extract the text from this image"},
{
"type": "image_url",
"image_url": {"url": INPUT_URL},
},
],
},
]
# gpt-4o-mini is cheap and fast and has vision capabilities
response = client.beta.chat.completions.parse(
response_format=DisclosureReport,
model="gpt-4o-mini",
messages=input_messages
)
message = response.choices[0].message
# Print it out in readable format
obj = json.loads(message.content)
print(json.dumps(obj, indent=2))
{
"document": {
"title": "Financial Disclosure Report",
"header": "Clerk of the House of Representatives \u2022 Legislative Resource Center \u2022 B81 Cannon Building \u2022 Washington, DC 20515",
"filer_information": {
"name": "Hon. Nancy Pelosi",
"status": "Member",
"state_district": "CA11"
},
"filing_information": {
"filing_type": "Annual Report",
"filing_year": "2023",
"filing_date": "05/15/2024"
},
"schedule_a": {
"title": "Schedule A: Assets and 'Unearned' Income",
"assets": [
{
"asset": "11 Zinfandel Lane - Home & Vineyard [RP]",
"owner": "JT",
"value": "$5,000,001 - $25,000,000",
"income_type": "Grape Sales",
"income": "$100,001 - $1,000,000",
"location": "St. Helena/Napa, CA, US"
},
{
"asset": "25 Point Lobos - Commercial Property [RP]",
"owner": "SP",
"value": "$5,000,001 - $25,000,000",
"income_type": "Rent",
"income": "$100,001 - $1,000,000",
"location": "San Francisco/San Francisco, CA, US"
},
{
"asset": "45 Belden Place - Four Story Commercial Building [RP]",
"owner": "SP",
"value": "$5,000,001 - $25,000,000",
"income_type": "Rent",
"income": "$100,001 - $1,000,000",
"location": "San Francisco/San Francisco, CA, US"
},
{
"asset": "AllianceBernstein Holding L.P. Units (AB) [OL]",
"owner": "SP",
"value": "$1,000,001 - $5,000,000",
"income_type": "Partnership Income",
"income": "$50,001 - $100,000",
"location": "New York, NY, US",
"description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors."
},
{
"asset": "Alphabet Inc. - Class A (GOOGL) [ST]",
"owner": "SP",
"value": "$5,000,001 - $25,000,000",
"income_type": "None"
},
{
"asset": "Amazon.com, Inc. (AMZN) [ST]",
"owner": "SP",
"value": "$5,000,001 - $25,000,000",
"income_type": "None"
}
]
}
}
}
{
"assets": [
{
"asset_name": "11 Zinfandel Lane - Home & Vineyard [RP]",
"owner": "JT",
"location": "St. Helena/Napa, CA, US",
"description": null,
"asset_value_low": 5000001,
"asset_value_high": 25000000,
"income_type": "Grape Sales",
"income_low": 100001,
"income_high": 1000000,
"tx_gt_1000": false
},
{
"asset_name": "25 Point Lobos - Commercial Property [RP]",
"owner": "SP",
"location": "San Francisco/San Francisco, CA, US",
"description": null,
"asset_value_low": 5000001,
"asset_value_high": 25000000,
"income_type": "Rent",
"income_low": 100001,
"income_high": 1000000,
"tx_gt_1000": false
},
{
"asset_name": "45 Belden Place - Four Story Commercial Building [RP]",
"owner": "SP",
"location": "San Francisco/San Francisco, CA, US",
"description": null,
"asset_value_low": 5000001,
"asset_value_high": 25000000,
"income_type": "Rent",
"income_low": 100001,
"income_high": 1000000,
"tx_gt_1000": false
},
{
"asset_name": "AllianceBernstein Holding L.P. Units (AB) [OL]",
"owner": "OL",
"location": "New York, NY, US",
"description": "Limited partnership in a global asset management firm providing investment management and research services worldwide to institutional, high-net-worth and retail investors.",
"asset_value_low": 1000001,
"asset_value_high": 5000000,
"income_type": "Partnership Income",
"income_low": 50001,
"income_high": 100000,
"tx_gt_1000": false
},
{
"asset_name": "Alphabet Inc. - Class A (GOOGL) [ST]",
"owner": "SP",
"location": null,
"description": null,
"asset_value_low": 5000001,
"asset_value_high": 25000000,
"income_type": "None",
"income_low": null,
"income_high": null,
"tx_gt_1000": false
},
{
"asset_name": "Amazon.com, Inc. (AMZN) [ST]",
"owner": "SP",
"location": null,
"description": null,
"asset_value_low": 5000001,
"asset_value_high": 25000000,
"income_type": "None",
"income_low": null,
"income_high": null,
"tx_gt_1000": false
}
]
}
{
"incidents": [
{
"date": "April 1",
"time": "9:30 p.m.",
"location": "Toyon parking lot",
"summary": "License plate stolen",
"category": "property",
"property_damage": "Rear license plate",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "A man",
"gender": "unknown",
"is_student": false
}
],
"incident_text": "A man reported that someone removed the rear license plate from his vehicle when it was parked in the Toyon parking lot."
},
{
"date": "April 1",
"time": "unknown",
"location": "Fry's Electronics",
"summary": "Unauthorized purchase reported",
"category": "other",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "A female administrator in Materials Science and Engineering",
"gender": "female",
"is_student": false
}
],
"incident_text": "A female administrator in Materials Science and Engineering reported that an administrative associate had made an unauthorized purchase of computer equipment at Fry's Electronics sometime in the past five months."
},
{
"date": "April 2",
"time": "3:30 p.m.",
"location": "unknown",
"summary": "Rear license plate stolen reported",
"category": "property",
"property_damage": "Rear license plate",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "Another man",
"gender": "unknown",
"is_student": false
}
],
"incident_text": "Another man reported that the rear license plate was missing from his vehicle."
},
{
"date": "April 2",
"time": "11:40 p.m.",
"location": "Rains apartments",
"summary": "Bike vandalized",
"category": "property",
"property_damage": "Wheel of bike",
"arrest_made": false,
"perpetrators": [
{
"description": "Two unknown suspects",
"gender": "unknown",
"is_student": false
}
],
"victims": [
{
"description": "A graduate student in the School of Education",
"gender": "unknown",
"is_student": true
}
],
"incident_text": "A graduate student in the School of Education reported that two unknown suspects vandalized the wheel of his bike when it was locked near the Rains apartments."
},
{
"date": "April 2",
"time": "11:40 p.m.",
"location": "Adelfa",
"summary": "Medical call",
"category": "call for service",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "unknown",
"gender": "unknown",
"is_student": false
}
],
"incident_text": "Police responded to an alcohol-related medical call in Adelfa."
},
{
"date": "April 3",
"time": "10:20 p.m.",
"location": "unknown",
"summary": "Bike citation",
"category": "other",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "A male undergraduate",
"gender": "male",
"is_student": true
}
],
"incident_text": "A male undergraduate was cited and released for running a stop sign on his bike and for not having a bike light or bike license."
},
{
"date": "April 3",
"time": "11:20 p.m.",
"location": "Lomita Drive",
"summary": "Minor in possession citation",
"category": "other",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "A woman",
"gender": "female",
"is_student": false
}
],
"incident_text": "A woman was cited and released on Lomita Drive for being a minor in possession of alcohol."
},
{
"date": "April 3",
"time": "11:40 p.m.",
"location": "unknown",
"summary": "Driving citation",
"category": "other",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "A man",
"gender": "male",
"is_student": false
}
],
"incident_text": "A man was cited and released for driving with a suspended license after he was stopped near Galvez Street and Campus Drive."
},
{
"date": "April 4",
"time": "1:00 a.m.",
"location": "unknown",
"summary": "Car damage reported",
"category": "property",
"property_damage": "Trunk and hood",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "A man",
"gender": "male",
"is_student": false
}
],
"incident_text": "A man reported that someone walked over the top of his car, causing damage to the trunk, top and hood."
},
{
"date": "April 4",
"time": "1:31 a.m.",
"location": "Mayfield Avenue",
"summary": "Bikes U-Locked incident",
"category": "other",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "Two men",
"gender": "unknown",
"is_student": false
}
],
"incident_text": "Two men were cited and released for walking with bikes U-Locked to themselves on Mayfield Avenue after neither could establish ownership of the bikes."
},
{
"date": "April 4",
"time": "3:05 a.m.",
"location": "Sigma Alpha Epsilon",
"summary": "Altercation reported",
"category": "other",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [
{
"description": "Two undergraduate suspects",
"gender": "unknown",
"is_student": true
}
],
"victims": [
{
"description": "A male undergraduate",
"gender": "male",
"is_student": true
}
],
"incident_text": "A male undergraduate was hit with a bottle on the back of his head during an altercation at Sigma Alpha Epsilon. Two undergraduates were classified as suspects, but no one was arrested."
},
{
"date": "April 5",
"time": "2:45 a.m.",
"location": "unknown",
"summary": "Arrest for intoxication",
"category": "other",
"property_damage": "None",
"arrest_made": true,
"perpetrators": [],
"victims": [
{
"description": "A man",
"gender": "male",
"is_student": false
}
],
"incident_text": "Police arrested a man for being drunk in public on Palm Drive near the entrance arch."
},
{
"date": "April 6",
"time": "7:15 a.m.",
"location": "Andronico's Supermarket",
"summary": "Assisted in detaining suspect",
"category": "call for service",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "unknown",
"gender": "unknown",
"is_student": false
}
],
"incident_text": "Police assisted Andronico's Supermarket in detaining a suspect after a call was made on a blue emergency phone. Palo Alto police later took the suspect into custody."
},
{
"date": "April 6",
"time": "3:20 p.m.",
"location": "Studio 3 on Angell Court",
"summary": "Accidental fire call",
"category": "call for service",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "unknown",
"gender": "unknown",
"is_student": false
}
],
"incident_text": "Police responded to an accidental fire call to Studio 3 on Angell Court."
},
{
"date": "April 6",
"time": "9:00 p.m.",
"location": "unknown",
"summary": "Found property reported",
"category": "property",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "A man",
"gender": "male",
"is_student": false
}
],
"incident_text": "A man reported that he found someone else's personal property in his locked car."
},
{
"date": "April 6",
"time": "1:45 a.m.",
"location": "San Jose main jail",
"summary": "Arrest for trespassing",
"category": "other",
"property_damage": "None",
"arrest_made": true,
"perpetrators": [],
"victims": [
{
"description": "A local vagrant",
"gender": "unknown",
"is_student": false
}
],
"incident_text": "A local vagrant was booked into the San Jose main jail for trespassing - his seventh time trespassing in a month."
},
{
"date": "April 6",
"time": "2:50 a.m.",
"location": "Palm Drive",
"summary": "Driving citation",
"category": "other",
"property_damage": "None",
"arrest_made": false,
"perpetrators": [],
"victims": [
{
"description": "A man",
"gender": "male",
"is_student": false
}
],
"incident_text": "A man was cited and released for driving without a license on Palm Drive."
}
]
}
{
"assets": [
{
"owner": "SP",
"asset_name": "820 Sir Francis Drake Blvd., San Anselmo, CA - Commercial Property",
"asset_value_low": 1000001,
"asset_value_high": 5000000,
"income_type": "RENT",
"income_low": 100001,
"income_high": 1000000,
"transaction_type": "P"
},
{
"owner": "SP",
"asset_name": "Access Technology Partners, LP",
"asset_value_low": 0,
"asset_value_high": 0,
"income_type": "PARTNERSHIP INCOME/(LOSS)",
"income_low": -1000000,
"income_high": -1,
"transaction_type": "S"
},
{
"owner": "SP",
"asset_name": "Active, LLC",
"asset_value_low": 15001,
"asset_value_high": 50000,
"income_type": "PARTNERSHIP INCOME/(LOSS)",
"income_low": -200,
"income_high": -1,
"transaction_type": "P"
},
{
"owner": "SP",
"asset_name": "Agile Software Corp. - Public Common Stock",
"asset_value_low": 0,
"asset_value_high": 0,
"income_type": "CAPITAL GAIN",
"income_low": 15001,
"income_high": 50000,
"transaction_type": "S"
},
{
"owner": "SP",
"asset_name": "Akamai Technologies Inc. - Public Common Stock",
"asset_value_low": 50001,
"asset_value_high": 100000,
"income_type": "NONE",
"income_low": null,
"income_high": null,
"transaction_type": "P"
},
{
"owner": "SP",
"asset_name": "Alcatel Lucent Ads - Public Common Stock",
"asset_value_low": 1001,
"asset_value_high": 15000,
"income_type": "DIVIDENDS",
"income_low": 1,
"income_high": 200,
"transaction_type": null
},
{
"owner": "SP",
"asset_name": "Alcoa Inc. - Public Common Stock",
"asset_value_low": 15001,
"asset_value_high": 50000,
"income_type": "DIVIDENDS",
"income_low": 201,
"income_high": 1000,
"transaction_type": "P"
},
{
"owner": "SP",
"asset_name": "American International Group Inc. - Public Common Stock",
"asset_value_low": 250001,
"asset_value_high": 500000,
"income_type": "DIVIDENDS",
"income_low": 2501,
"income_high": 5000,
"transaction_type": null
},
{
"owner": "SP",
"asset_name": "Americas Doctors.com - Preferred Stock",
"asset_value_low": 1001,
"asset_value_high": 15000,
"income_type": "NONE",
"income_low": null,
"income_high": null,
"transaction_type": null
},
{
"owner": "SP",
"asset_name": "Apple Computer - Public Common Stock",
"asset_value_low": 5000001,
"asset_value_high": 25000000,
"income_type": "CAPITAL GAIN",
"income_low": 100001,
"income_high": 1000000,
"transaction_type": "S"
},
{
"owner": "SP",
"asset_name": "Aristotle, LLC",
"asset_value_low": 15001,
"asset_value_high": 50000,
"income_type": "NONE",
"income_low": null,
"income_high": null,
"transaction_type": null
},
{
"owner": "SP",
"asset_name": "Ashlar, Inc. - Common Stock",
"asset_value_low": 0,
"asset_value_high": 0,
"income_type": "CAPITAL GAIN/(LOSS)",
"income_low": -1001,
"income_high": -201,
"transaction_type": "S"
},
{
"owner": "SP",
"asset_name": "AT&T - Public Common Stock",
"asset_value_low": 250001,
"asset_value_high": 500000,
"income_type": "DIVIDENDS",
"income_low": 5001,
"income_high": 15000,
"transaction_type": null
}
]
}
@nicholishen
Copy link

I'd like to offer a couple of suggestions that could enhance the effectiveness and reliability of your approach:

  1. Omitting the System Message for Structured Outputs:

    When utilizing Structured Outputs with OpenAI's API, you can omit the system message that specifies 'JSON' outputs. This requirement was primarily relevant to the older response_format="json_object" mode. With the introduction of structured outputs, the API now inherently understands and adheres to the defined schema without needing explicit instructions to format the response as JSON.

  2. Using Enum or typing.Literal for Constrained Parameters:

    To ensure that parameters with limited, predefined options (like the category field in your police blotter example) strictly adhere to those options, it's essential to define them using Python's Enum or typing.Literal. This approach guarantees that only the specified values are generated, as the LLM's constrained generation mechanism masks all tokens except those defined in the Enum or Literal. This not only enforces the constraints within your data model but also enhances the JSON schema by serializing these fields as enums. Consequently, the language model (LLM) is guaranteed to generate outputs that only include the specified enum values, thereby maintaining data consistency and eliminating the risk of unexpected or invalid entries.

    Implementing with Enum:

    from enum import Enum
    from pydantic import BaseModel, Field
    
    class CategoryEnum(str, Enum):
        violent = "violent"
        property = "property"
        traffic = "traffic"
        call_for_service = "call for service"
        other = "other"
    
    class Incident(BaseModel):
        # ... other fields ...
        category: CategoryEnum = Field(
            description="""Type of report, broadly speaking: "violent", "property", "traffic", "call for service", or "other"."""
        )
        # ... remaining model ...

    Implementing with typing.Literal:

    from typing import Literal
    from pydantic import BaseModel, Field
    
    class Incident(BaseModel):
        # ... other fields ...
        category: Literal["violent", "property", "traffic", "call for service", "other"] = Field(
            description="""Type of report, broadly speaking: "violent", "property", "traffic", "call for service", or "other"."""
        )
        # ... remaining model ...

    Benefits:

    • Validation: Pydantic will enforce that only the specified values are accepted, raising validation errors for any deviations.
    • Schema Clarity: The generated JSON schema will clearly define the allowed values, aiding both developers and the LLM in understanding the expected data.
    • LLM Guarantee: By constraining the field to specific values using Enum or typing.Literal, the LLM is guaranteed to produce outputs that only include the specified enum values, ensuring strict adherence to the defined schema.

@dannguyen
Copy link
Author

@nicholishen that's a good point about using Enum. I left it open for the purposes of the experiment just to see how/if the model would come up with its own categories.

@morisil
Copy link

morisil commented Oct 15, 2024

Thank you for this post. I just made a library anthropic-sdk-kotlin, which features the use of tools and automatic schema generation out of serializable data classes. I wanted to give your example a try with Anthropic API, and it works flawlessly:

https://github.com/xemantic/anthropic-sdk-kotlin/blob/main/src/jvmTest/kotlin/StructuredOutputTest.kt

package com.xemantic.anthropic

import com.xemantic.anthropic.message.*
import com.xemantic.anthropic.tool.AnthropicTool
import com.xemantic.anthropic.tool.UsableTool
import io.kotest.assertions.assertSoftly
import io.kotest.matchers.collections.shouldHaveSize
import io.kotest.matchers.shouldBe
import kotlinx.coroutines.test.runTest
import kotlinx.serialization.Serializable
import kotlin.test.Test

/**
 * This test tool is based on the
 * [article by Dan Nguyen](https://gist.github.com/dannguyen/faaa56cebf30ad51108a9fe4f8db36d8),
 * who showed how to extract financial disclosure reports as structured data by using OpenAI API.
 * I wanted to try out the same approach with Anthropic API, and it seems like a great test case of this library.
 */
@AnthropicTool(
  name = "DisclosureReport",
  description = "Extract the text from this image"
)
class DisclosureReport(
  val assets: List<Asset>
) : UsableTool {
  override suspend fun use(toolUseId: String) = ToolResult(
    toolUseId, "Data provided to client"
  )
}

@Serializable
data class Asset(
  val assetName: String,
  val owner: String,
  val location: String?,
  val assetValueLow: Int?,
  val assetValueHigh: Int?,
  val incomeType: String,
  val incomeLow: Int?,
  val incomeHigh: Int?,
  val txGt1000: Boolean
)

/**
 * This test is located in the jvmTest folder, so it can use File API to read image files.
 * In the future it can probably be moved to jvmAndPosix to support all the Kotlin platforms
 * having access to the filesystem.
 */
class StructuredOutputTest {

  @Test
  fun shouldDecodeStructuredOutputFromReportImage() = runTest {
    val client = Anthropic {
      tool<DisclosureReport>()
    }

    val response = client.messages.create {
      +Message {
        +"Decode structured output from supplied image"
        +Image(
          path = "test-data/financial-disclosure-report.png",
          mediaType = Image.MediaType.IMAGE_PNG
        )
      }
      useTool<DisclosureReport>()
    }

    val tool = response.content.filterIsInstance<ToolUse>().first()
    val report = tool.input<DisclosureReport>()

    report.assets shouldHaveSize 6
    assertSoftly(report.assets[0]) {
      assetName shouldBe "11 Zinfandel Lane - Home & Vineyard [RP]"
      owner shouldBe "JT"
      location shouldBe "St. Helena/Napa, CA, US"
      assetValueLow shouldBe 5000001
      assetValueHigh shouldBe 25000000
      incomeType shouldBe "Grape Sales"
      incomeLow shouldBe 100001
      incomeHigh shouldBe 1000000
      txGt1000 shouldBe false
    }
    assertSoftly(report.assets[1]) {
      assetName shouldBe "25 Point Lobos - Commercial Property [RP]"
      owner shouldBe "SP"
      location shouldBe "San Francisco/San Francisco, CA, US"
      assetValueLow shouldBe 5000001
      assetValueHigh shouldBe 25000000
      incomeType shouldBe "Rent"
      incomeLow shouldBe 100001
      incomeHigh shouldBe 1000000
      txGt1000 shouldBe false
    }
  }

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment