-
-
Save am0d/1346935 to your computer and use it in GitHub Desktop.
from BeautifulSoup import BeautifulSoup, Tag | |
import re | |
# Read the raw file into raw | |
f = open("BoP - APV - taggeds.html") | |
raw = f.read() | |
print '[*] Read raw text in' | |
# Construct a BeautifulSoup out of the raw HTML | |
soup = BeautifulSoup(raw) | |
print '[*] Constructed BeautifulSoup' | |
# Holds all our individual pages | |
pages = [] | |
# Pull out all the text (with tags) between HR tags, which | |
# delineate the pages in the document | |
tag = soup.find("hr") | |
def textTillNextPage(t): | |
s = "" | |
while t != None and not (isinstance(t, Tag) and t.name == 'hr'): | |
s += str(t) | |
t = t.nextSibling | |
return t, s | |
# Loop through the document until the end, and pull out each page | |
t = tag.nextSibling | |
while not (t == None): | |
t, s = textTillNextPage(t) | |
pages.append(s) | |
if t != None: | |
t = t.nextSibling | |
print '[*] Separated all the pages out' | |
#Create the relevant Regular Expressions | |
r = re.compile(r'PSALM ([0-9]+)', re.I) | |
r2 = re.compile(ur"[#\xc5\xc3]", re.I) | |
def getTextOffPage(n): | |
s = BeautifulSoup(pages[n]) | |
t = s.first() | |
while t != None and not r.search(str(t)): | |
t = t.nextSibling | |
if t is None: | |
return None, '' | |
psalmNum = r.search(str(t)).groups()[0] | |
p = BeautifulSoup() | |
while t != None: | |
nxt = t.nextSibling | |
if not r2.search(str(t)): | |
p.append(t) | |
else: | |
t = nxt | |
while t != None and (r2.search(str(t)) or (isinstance(t, Tag) and t.name == 'br')): | |
t = t.nextSibling | |
nxt = t | |
t = nxt | |
return p, psalmNum | |
pageNum = 10 | |
psalm = 1 | |
while pageNum < 368: | |
thisPsalm, psalm = getTextOffPage(pageNum) | |
if psalm == None: | |
pageNum = pageNum + 1 | |
continue | |
pageNum = pageNum + 1 | |
nextPsalm = psalm | |
while nextPsalm == psalm: | |
nextPage, nextPsalm = getTextOffPage(pageNum) | |
if nextPsalm == psalm: | |
nextPage.first().extract() #Removes the title from the top of the page | |
thisPsalm.append(nextPage) | |
pageNum = pageNum + 1 | |
f = open('pages/psalm ' + psalm + '.html', 'w') | |
f.write(str(thisPsalm)) | |
f.close() | |
print '[*] Psalm ' + psalm |
How does it work? Does it split html with
<hr>
tag or on the basis of page size?
This was written as a one-off for a very specific case.
I had a pdf file that I had converted to HTML, but I wanted to split it into multiple individual html files.
The original file contained 150 psalms, as well as other contents which I was not concerned with for this process.
Each page of the original PDF, was separated in this document by a <hr />
tag.
Something like this:
PDF Page 1
<hr />
PDF Page 2
<hr />
PDF Page 3
<hr />
.
.
.
(etc.)
The psalms that I wanted to extract, potentially spanned multiple pages of the PDF.
Each psalm had a header like PSALM 13
, which is what the regex r
would match.
The entire process in this gist went something like this:
- Split the html file into the individual PDF pages (delineated by the
<hr />
tags). - Beginning at page 10 of the PDF (determined by manual inspection), get the text for the next "pdf page".
a. Construct a BeautifulSoup instance of the "pdf page"
b. Iterate through the elements on this page, until we find a tag with contents matchingr
(the "Psalm xxx" header)- If not found on this page, return
None
, which makes step 2 continue to the next page
c. Once that header is found, construct a newBeautifulSoup
instance from the remaining elements on that page - I think (I no longer have the files) there was one
<p>
per line of text that I wanted. - Some lines contained musical score rather than text - I wasn't interested in those, I believe that's what
r2
filtered out.
d. Once we've reached then end of that "pdf page", return the newly constructedBeautifulSoup
and the psalm number it belongs to
- If not found on this page, return
- If the page we've just parsed belongs to the same psalm as the previous page:
a. Remove the first tag from the page, which is just the psalm number.
b. Append the remaining contents of the new page, to theBeautifulSoup
instance containing the previous pages of the psalm.
c. Repeat with the following "pdf pages" until we reach the next psalm - Write out the
BeautifulSoup
instance to a file. - Continue the process for the next psalm.
Some notes:
- This was a unoptimized process for a one time job.
- The first page of each psalm would end up being parsed twice, because while looking for the next page of a psalm, this would fully parse each next page, but then discard the results if that page did not belong to the same psalm.
- This process was not perfect, I had to go through all the generated pages afterwards and do additional cleanup. This was in large part due to the way the PDF had originally been converted to HTML.
@smit245 Hope that gives some insight. It's been a long time since I wrote this code, and I don't think I have any of the documents that I was processing, so I can't give much more insight than this.
How does it work?
Does it split html with
<hr>
tag or on the basis of page size?