Skip to content

Instantly share code, notes, and snippets.

View dannguyen's full-sized avatar
💭
havin a normal one

Dan Nguyen dannguyen

💭
havin a normal one
View GitHub Profile
@dannguyen
dannguyen / README.openai-structured-output-demo.md
Last active November 21, 2024 01:53
A basic test of OpenAI's Structured Output feature against financial disclosure reports and a newspaper's police blotter. Code examples use the Python SDK and pydantic for the schema definition.

Extracting financial disclosure reports and police blotter narratives using OpenAI's Structured Output

tl;dr this demo shows how to call OpenAI's gpt-4o-mini model, provide it with URL of a screenshot of a document, and extract data that follows a schema you define. The results are pretty solid even with little effort in defining the data — and no effort doing data prep. OpenAI's API could be a cost-efficient tool for large scale data gathering projects involving public documents.

OpenAI announced Structured Outputs for its API, a feature that allows users to specify the fields and schema of extracted data, and guarantees that the JSON output will follow that specification.

For example, given a Congressional financial disclosure report, with assets defined in a table like this:

@dannguyen
dannguyen / skimschema.py
Created September 18, 2024 18:03
A command-line python script that reads CSV files, samples their data, and prints the samples in transposed longform, i.e. one column per data row, one row per data attribute
#!/usr/bin/env python3
"""
skimschema.py
==============
Create an excel file of transposed data rows, for easy browsing of
a data file's contents (csvs only for now)
Longer description
@dannguyen
dannguyen / bq-sfpd-query.sql
Created July 26, 2022 16:00
Example of querying BigQuery's public dataset of SFPD crime incidents
SELECT
unique_key
, pddistrict AS pd_district
, DATE(timestamp) AS incident_date
, category
, descript AS description
, dayofweek AS day_of_week
, resolution
, UPPER(address) AS address
, longitude
@dannguyen
dannguyen / fetch_ghstars.md
Last active November 14, 2024 00:16
fetch_ghstars.py: quick CLI script to fetch from Github API all of a user's starred repos and save it as raw JSON and wrangled CSV

fetch_ghstars.py: quick CLI script to fetch and collate from Github API all of a user's starred repos

  • Requires Python 3.6+
  • Creates a subdir 'ghstars-USERNAME' at the current working directory
  • the raw JSON of each page request is saved as: 01.json, 02.json 0n.json
  • A flattened, filtered CSV is also created: wrangled.csv

Example usage:

@dannguyen
dannguyen / aws-transcribe-2020-10-biden-palin.md
Last active February 10, 2021 01:29
i only created this gist to respond to someone responding to my older aws-transcribe-via-cli gist

Amazon Transcribe (real-time) streaming sample, with speakers identified (2020-10-09)

Note: This gist refers this older gist that shows the AWS transcribe API: https://gist.github.com/dannguyen/9b8c51f5bb853209f19f1a0f18f0f74c

I went into the AWS console for Transcription, which has an interface for real-time transcription here: https://console.aws.amazon.com/transcribe/home?region=us-east-1#realTimeTranscription

Then I used my phone to play out this snippet of the 2008 VP presidential debate, featuring speech from Biden and Palin: https://twitter.com/dancow/status/1313951588428517385

fieldname value
act 1
scene 5
speaker Horatio
lines Propose the oath, my lord.
~~~~~~~~~
act 1
scene 5
speaker Hamlet
@dannguyen
dannguyen / README-xsv-split-windows.md
Last active August 27, 2020 07:00
How to install and use xsv to split a large CSV file (Windows)

How to use xsv (in Windows) to split up a CSV file too big for Excel

I wrote these instructions on how to install and use xsv – a powerful CSV-handling command-line tool, because someone asked how to deal with a data file that was too big to open in Excel or even Notepad. I didn't know how familiar the person was with installing/running downloadable .exe files or with Powershell, so I've tried to include some general instructions that hopefully are useful to even novices.

This mini-guide is not at all meant to be exhaustive as it basically shows just one of xsv's many useful functions. But if you're new to the idea of using command-line tools to do things, hopefully this can be a friendly intro to it.


Here's an example of a CSV that, at 3 million rows, is too big for Excel to open: https://burntsushi.net/stuff/worldcitiespop.csv

@dannguyen
dannguyen / bash-prompt.md
Last active August 19, 2020 00:05
my bash prompt with a ghost and stuff

this goes in my bash profile:

XRESET='\[\033[00m\]'
PROMPT_PATH="\[\033[0;33m\]\W${XRESET} \[\033[1;37m\]\$${XRESET}"
PROMPT_GHOST="༼ つ\[\033[1;33m\]°${XRESET}\[\033[1;31m\]︻\[\033[1;33m\]゜${XRESET}༽つ🐕"

export PS1="${PROMPT_GHOST} ${PROMPT_PATH} "
@dannguyen
dannguyen / normalize-ascii-google-sheet-README.md
Last active August 25, 2020 22:17
A modified Google App Script hack to normalize Vietnamese characters into ASCII
@dannguyen
dannguyen / DANS SECRET STUFF.md
Created July 16, 2020 22:16
DANS SECRET STUFF

test test test