Created
May 13, 2018 20:09
-
Star
(261)
You must be signed in to star a gist -
Fork
(60)
You must be signed in to fork a gist
script for ImageNet data extract.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# script to extract ImageNet dataset | |
# ILSVRC2012_img_train.tar (about 138 GB) | |
# ILSVRC2012_img_val.tar (about 6.3 GB) | |
# make sure ILSVRC2012_img_train.tar & ILSVRC2012_img_val.tar in your current directory | |
# | |
# https://github.com/facebook/fb.resnet.torch/blob/master/INSTALL.md | |
# | |
# train/ | |
# ├── n01440764 | |
# │ ├── n01440764_10026.JPEG | |
# │ ├── n01440764_10027.JPEG | |
# │ ├── ...... | |
# ├── ...... | |
# val/ | |
# ├── n01440764 | |
# │ ├── ILSVRC2012_val_00000293.JPEG | |
# │ ├── ILSVRC2012_val_00002138.JPEG | |
# │ ├── ...... | |
# ├── ...... | |
# | |
# | |
# Extract the training data: | |
# | |
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train | |
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar | |
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done | |
cd .. | |
# | |
# Extract the validation data and move images to subfolders: | |
# | |
mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar | |
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash | |
# | |
# Check total files after extract | |
# | |
# $ find train/ -name "*.JPEG" | wc -l | |
# 1281167 | |
# $ find val/ -name "*.JPEG" | wc -l | |
# 50000 | |
# |
Thanks!
For those downloading from huggingface
Train dataset
Untar the tarballs into the "train" and "val" directories then use the following:
import os
import shutil
train_images = os.listdir('train/')
for image in train_images:
split = image.split('_')
cls_name = split[0]
if not os.path.exists('train/' + cls_name):
#print('creating dir: ', 'train/' + cls_name)
os.makedirs('train/' + cls_name, exist_ok=True)
src = 'train/' + image
destination = 'train/' + cls_name + '/'
#print('moving')
#print(src)
#print(destination)
shutil.move(src, destination)
Val dataset
A minor adjustment has to be performed here because the val directories are being named as "cls_name.JPG" which might cause issues
i haven't tried but perhaps doing > cls_name = cls_name.replace(".JPG", "") might solve the issue
import os
import shutil
val_images = os.listdir('val/')
for image in val_images:
split = image.split('_')
cls_name = split[3]
if not os.path.exists('val/' + cls_name):
#print('creating dir: ', 'val/' + cls_name)
os.makedirs('val/' + cls_name, exist_ok=True)
src = 'val/' + image
destination = 'val/' + cls_name + '/'
#print('moving')
#print(src)
#print(destination)
shutil.move(src, destination)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks a lot!