Quantcast
Channel: Active questions tagged html - Stack Overflow
Viewing all articles
Browse latest Browse all 67411

Python html parsing of div data using bs4

$
0
0

I had a pdf from which I had to extract the text, I used Tika to parse the same. Since Tika could not do the parsing page wise, I used beautiful soup to achieve the same (Below is the code snippet). Now I want to remove the header and footer of the html page which Tika outputs. I have figured out that header and footer appears as last two lines of each div. Can anyone tell me how to extract all the data from a div except the last two lines as below:

<div class="page"><p />
<p></p>
<p>First line required
</p>
<p>Second line required
</p>
<p>Third line required
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
</p>
<p></p>
</div>
<div class="page"><p />
<p>line required 1
</p>
<p></p>
<p>line required 2
</p>
<p>line required 3
</p>
<p></p>
<p>line required 4
</p>
<p>line required 5
</p>
<p>line required 6
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
<p />
</div>

Existing code as below:

from tika import parser
raw = parser.from_file('pdfpath', xmlContent=True)
file_content = raw["content"]
soup = BeautifulSoup(file_content, 'html.parser')
for num, page in enumerate(soup.select('.page'), 1):
    content = page.get_text(strip=True, separator=' ').replace("\n", " ")

Viewing all articles
Browse latest Browse all 67411

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>