Skip to content

Instantly share code, notes, and snippets.

@mhmdiaa
Last active November 21, 2024 05:15
Show Gist options
  • Save mhmdiaa/adf6bff70142e5091792841d4b372050 to your computer and use it in GitHub Desktop.
Save mhmdiaa/adf6bff70142e5091792841d4b372050 to your computer and use it in GitHub Desktop.
import requests
import sys
import json
def waybackurls(host, with_subs):
if with_subs:
url = 'http://web.archive.org/cdx/search/cdx?url=*.%s/*&output=json&fl=original&collapse=urlkey' % host
else:
url = 'http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey' % host
r = requests.get(url)
results = r.json()
return results[1:]
if __name__ == '__main__':
argc = len(sys.argv)
if argc < 2:
print('Usage:\n\tpython3 waybackurls.py <url> <include_subdomains:optional>')
sys.exit()
host = sys.argv[1]
with_subs = False
if argc > 3:
with_subs = True
urls = waybackurls(host, with_subs)
json_urls = json.dumps(urls)
if urls:
filename = '%s-waybackurls.json' % host
with open(filename, 'w') as f:
f.write(json_urls)
print('[*] Saved results to %s' % filename)
else:
print('[-] Found nothing')
@rrampage
Copy link

rrampage commented Sep 30, 2020

A bash function which uses jq (not for sub-domain search but works for any URL prefix). It gives the full web archive url which is generally of format https://web.archive.org/web/$TIMESTAMP/$ORIGINAL:

wb () 
{ 
    if [[ -z $1 ]]; then
        echo "Usage: $0 URL";
    else
        curl "http://web.archive.org/cdx/search/cdx?url=$1/*&output=json&fl=original,timestamp" 2> /dev/null | jq '.[1:][] |"https://web.archive.org/web/" +.[1] + "/" + .[0]' 2> /dev/null;
    fi
}

This can be added to the ~/.bashrc or relevant shell profile.

Usage: wb gist.github.com/mhmdiaa

@akamhy
Copy link

akamhy commented Oct 2, 2020

Hi,
Just wanted to tell you that I used your Idea in https://github.com/akamhy/waybackpy. [commit]

Usage :

pip3 install waybackpy
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls

Output:

http://akamhy.github.io
https://akamhy.github.io/favicon.ico
https://akamhy.github.io/robots.txt
https://akamhy.github.io/waybackpy/
https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11
https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c
6 URLs found and saved in ./akamhy.github.io-6-urls.txt

Flags:

  1. '--alive' will only fetch URLs that are not dead. alive will be slower for websites with too many archived URLs e.g. google
  2. '--subdomain' will include URLs from subdomains.

See live use @ https://repl.it/@akamhy/Waybackpy-Known-Urls#main.sh

@BUNTY070
Copy link

thanku man>

@odalpride
Copy link

What to do if you have installed wb in python and want to try it in go. They have the same initialization. How to use it in this case?

@00xNetrunner
Copy link

Hey man just want to say i used your idea as-well. you have been credited :) i made the script because the waybackurls tool was not working on my install.

@mohammedouahman
Copy link

it works well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment