Extracting financial disclosure and police reports with OpenAI Structured Output

gist.github.com

120 points by danso 4 days ago

tpswa 6 hours ago

Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.

danso 4 hours ago
I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:
```
     openai.BadRequestError: Error code: 400 - {'error': {'message': "Missing required parameter: 'response_format.json_schema'.", 'type': 'invalid_request_error', 'param': 'response_format.json_schema', 'code': 'missing_required_parameter'}}
```
I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own
[0] https://platform.openai.com/docs/guides/structured-outputs/s...
ec109685 5 hours ago

I’ve been using jsonschema since forever with function calling. Does structured output just formalize things?
- chaos_emergent 2 hours ago
  
  function calling provides a "hint" in the form of a JSON schema for an LLM to follow. the models are trained to follow provided schemas. If you have really complicated or deeply nested models, they can become less stable at generating schema-conformant JSON.
  Structured outputs apply a context-free grammar to the prediction generation so that, for each token generation, only tokens that generate a perfectly conformant JSON schema are considered.
  The benefit of doing this is predictability, but there's a trade-off in prediction stability; apparently structured output can constrain the model to generate in a way that takes it off the "happy path" of how it assumes text should be generated.
  Happy to link you to some papers I've skimmed on it if you're interested!
- throwup238 4 hours ago
  
  Structured output uses "constrained decoding" under the hood. They convert the JSON schema to a context free grammar so that when the model samples tokens, invalid tokens are masked to have a probability of zero. It's much less likely to go off the rails.

Zaheer 7 hours ago

Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/

There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!

hackernewds 5 hours ago

Is this simply the OCR bits to feed to openai structured output?
matchagaucho 6 hours ago

Similarly I've found old-school OCR is needed for more reliability.
- MarkMarine 2 hours ago
  
  I've been using this to OCR some photos I took of books and it's remarkable at it. My first pass was just a loop where I'd OCR, feed the text to the model and ask it to normalize into a schema but I found out just sending the image to the model and asking it to OCR and turn it into the shape of data I wanted was so much more accurate.
- bagels 6 hours ago
  
  Combining google's ocr with llm gives OCR superpowers. Tell the llm the text is from an ocr and ask it to correct it.
  
  saturn8601 3 hours ago
  
  That sounds like it could be very dangerous when the LLM gets it wrong...
artisandip7 6 hours ago

tried it works great, ty!

marcell 5 hours ago

I’m making a free open source library for this, check it at http://github.com/fetchfox/fetchfox

MIT license. It’s just one line of code to get started: ‘fox.run(“get data from example.com”)’

TrackerFF 6 hours ago

We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.

Worked better than any OCR we tried.

minimaxir 6 hours ago

> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.

OpenAI's API only accepts images: https://platform.openai.com/docs/guides/vision

To my knowledge, all the LLM services that take in PDF input do their own text extraction of the PDF before feeding it to an LLM.

tyre 5 hours ago

or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems
ec109685 5 hours ago

I don’t think it does OCR. It’s able to use the structure of the PDF to guide the parsing.

beoberha 5 hours ago

Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.

0tfoaij 4 hours ago

OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.
minimaxir 5 hours ago

The Berkeley Function-Calling Leaderboard tracks function calling/structured data performance from multiple models: https://gorilla.cs.berkeley.edu/leaderboard.html
Llama isn't on there but a few finetunes of it (Hermes) are OSS.
kgeist 2 hours ago

In my tests, Llama 3.1 8b was way worse than Llama 2 13b or Solar 13b.
thatcat 5 hours ago

I mean, those aren't comparable models. I wonder how the 405b version compares.
- Tiberium 4 hours ago
  
  You raise a valid point, but 4o is way smaller than 405B. And 4o mini that's described in the article is highly likely <30B (if we're talking dense models).
  
  maleldil 3 hours ago
  
  Is the size of OpenAI's models public, or is this guesswork?
  
  qwe----3 30 minutes ago
  
  If your company has a lot of ex openai employees then you know ;)
  And the public numbers are mostly right, the latest values are likely smaller now- they have been working on down sizing everything
A4ET8a8uTh0 4 hours ago

<< Stuff like this shows how much better the commercial models are than local models.
I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.
edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.

4ad an hour ago

What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.