I had a pdf from which I had to extract the text, I used Tika to parse the same. Since Tika could not do the parsing page wise, I used beautiful soup to achieve the same (Below is the code snippet). Now I want to remove the header and footer of the html page which Tika outputs. I have figured out that header and footer appears as last two lines of each div. Can anyone tell me how to extract all the data from a div except the last two lines as below:
<div class="page"><p />
<p></p>
<p>First line required
</p>
<p>Second line required
</p>
<p>Third line required
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
</p>
<p></p>
</div>
<div class="page"><p />
<p>line required 1
</p>
<p></p>
<p>line required 2
</p>
<p>line required 3
</p>
<p></p>
<p>line required 4
</p>
<p>line required 5
</p>
<p>line required 6
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
<p />
</div>
Existing code as below:
from tika import parser
raw = parser.from_file('pdfpath', xmlContent=True)
file_content = raw["content"]
soup = BeautifulSoup(file_content, 'html.parser')
for num, page in enumerate(soup.select('.page'), 1):
content = page.get_text(strip=True, separator=' ').replace("\n", " ")