Skip to content

Instantly share code, notes, and snippets.

@am0d
Created November 8, 2011 03:39
Show Gist options
  • Save am0d/1346935 to your computer and use it in GitHub Desktop.
Save am0d/1346935 to your computer and use it in GitHub Desktop.
BeautfulSoup - split single HTML file into multiple pages
from BeautifulSoup import BeautifulSoup, Tag
import re
# Read the raw file into raw
f = open("BoP - APV - taggeds.html")
raw = f.read()
print '[*] Read raw text in'
# Construct a BeautifulSoup out of the raw HTML
soup = BeautifulSoup(raw)
print '[*] Constructed BeautifulSoup'
# Holds all our individual pages
pages = []
# Pull out all the text (with tags) between HR tags, which
# delineate the pages in the document
tag = soup.find("hr")
def textTillNextPage(t):
s = ""
while t != None and not (isinstance(t, Tag) and t.name == 'hr'):
s += str(t)
t = t.nextSibling
return t, s
# Loop through the document until the end, and pull out each page
t = tag.nextSibling
while not (t == None):
t, s = textTillNextPage(t)
pages.append(s)
if t != None:
t = t.nextSibling
print '[*] Separated all the pages out'
#Create the relevant Regular Expressions
r = re.compile(r'PSALM ([0-9]+)', re.I)
r2 = re.compile(ur"[#\xc5\xc3]", re.I)
def getTextOffPage(n):
s = BeautifulSoup(pages[n])
t = s.first()
while t != None and not r.search(str(t)):
t = t.nextSibling
if t is None:
return None, ''
psalmNum = r.search(str(t)).groups()[0]
p = BeautifulSoup()
while t != None:
nxt = t.nextSibling
if not r2.search(str(t)):
p.append(t)
else:
t = nxt
while t != None and (r2.search(str(t)) or (isinstance(t, Tag) and t.name == 'br')):
t = t.nextSibling
nxt = t
t = nxt
return p, psalmNum
pageNum = 10
psalm = 1
while pageNum < 368:
thisPsalm, psalm = getTextOffPage(pageNum)
if psalm == None:
pageNum = pageNum + 1
continue
pageNum = pageNum + 1
nextPsalm = psalm
while nextPsalm == psalm:
nextPage, nextPsalm = getTextOffPage(pageNum)
if nextPsalm == psalm:
nextPage.first().extract() #Removes the title from the top of the page
thisPsalm.append(nextPage)
pageNum = pageNum + 1
f = open('pages/psalm ' + psalm + '.html', 'w')
f.write(str(thisPsalm))
f.close()
print '[*] Psalm ' + psalm
@smit245
Copy link

smit245 commented Apr 28, 2023

How does it work?
Does it split html with <hr> tag or on the basis of page size?

@am0d
Copy link
Author

am0d commented Apr 29, 2023

How does it work? Does it split html with <hr> tag or on the basis of page size?

This was written as a one-off for a very specific case.

I had a pdf file that I had converted to HTML, but I wanted to split it into multiple individual html files.
The original file contained 150 psalms, as well as other contents which I was not concerned with for this process.

Each page of the original PDF, was separated in this document by a <hr /> tag.

Something like this:

PDF Page 1
<hr />
PDF Page 2
<hr />
PDF Page 3
<hr />
.
.
.
(etc.)

The psalms that I wanted to extract, potentially spanned multiple pages of the PDF.
Each psalm had a header like PSALM 13, which is what the regex r would match.

The entire process in this gist went something like this:

  1. Split the html file into the individual PDF pages (delineated by the <hr /> tags).
  2. Beginning at page 10 of the PDF (determined by manual inspection), get the text for the next "pdf page".
    a. Construct a BeautifulSoup instance of the "pdf page"
    b. Iterate through the elements on this page, until we find a tag with contents matching r (the "Psalm xxx" header)
    • If not found on this page, return None, which makes step 2 continue to the next page
      c. Once that header is found, construct a new BeautifulSoup instance from the remaining elements on that page
    • I think (I no longer have the files) there was one <p> per line of text that I wanted.
    • Some lines contained musical score rather than text - I wasn't interested in those, I believe that's what r2 filtered out.
      d. Once we've reached then end of that "pdf page", return the newly constructed BeautifulSoup and the psalm number it belongs to
  3. If the page we've just parsed belongs to the same psalm as the previous page:
    a. Remove the first tag from the page, which is just the psalm number.
    b. Append the remaining contents of the new page, to the BeautifulSoup instance containing the previous pages of the psalm.
    c. Repeat with the following "pdf pages" until we reach the next psalm
  4. Write out the BeautifulSoup instance to a file.
  5. Continue the process for the next psalm.

Some notes:

  • This was a unoptimized process for a one time job.
  • The first page of each psalm would end up being parsed twice, because while looking for the next page of a psalm, this would fully parse each next page, but then discard the results if that page did not belong to the same psalm.
  • This process was not perfect, I had to go through all the generated pages afterwards and do additional cleanup. This was in large part due to the way the PDF had originally been converted to HTML.

@smit245 Hope that gives some insight. It's been a long time since I wrote this code, and I don't think I have any of the documents that I was processing, so I can't give much more insight than this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment