Skip to content

Instantly share code, notes, and snippets.

@afeld
Last active November 5, 2021 15:20
Show Gist options
  • Save afeld/a7a62271923c7a079d02f8f38efc0a78 to your computer and use it in GitHub Desktop.
Save afeld/a7a62271923c7a079d02f8f38efc0a78 to your computer and use it in GitHub Desktop.
reduce size of CSV with pandas
import pandas as pd
# maintain the original data format by reading everything as strings
original = pd.read_csv("original.csv", dtype="object")
# set the random_state so it's reproducible. alternatively, can pass a `frac` to use a percentage.
sampled = original.sample(n=5000, random_state=1).sort_index()
# exclude the index so the columns match the orignal
sampled.to_csv("sampled.csv", index=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment