Quantcast
Channel: Active questions tagged html - Stack Overflow
Viewing all articles
Browse latest Browse all 67441

Extracting HTML from outside the tag

$
0
0

I´m trying to extract the HTML part that is located above and below a <table> tag, so for example from the example html below:

sample_html = """<html>
<title><b>Main Title</b></Title>
<b>more</b>
<b>stuff</b>
<b>in here!</b>
<table class="softwares" border="1" cellpadding="0" width="99%">
    <thead style="background-color: #ededed">
        <tr>
            <td colspan="5"><b>Windows</b></td>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><b>Type</b></td>
            <td><b>Issue</b></td>
            <td><b>Restart</b></td>
            <td><b>Severity</b></td>  
            <td><b>Impact</b></td>  
        </tr>
        <tr>
            <td>some item</td>
            <td><a href="some website">some website</a><br></td>
            <td>Yes<br></td>
            <td>Critical<br></td>
            <td>stuff<br></td>
        </tr>    
        <tr>
            <td>some item</td>
            <td><a href="some website">some website</a><br></td>
            <td>Yes<br></td>
            <td>Important<br></td>
            <td>stuff<br></td>    
        </tr>
    </tbody>
</table>
<b>AGAIN</b>
<b>more</b>
<b>stuff</b>
<b>down here!</b>
</html>
"""

I would like to obtain something like.

top_html = """<html>
<title><b>Main Title</b></Title>
<b>more</b>
<b>stuff</b>
<b>in here!</b>
</html>
"""

bottom_html = """<html>
<b>AGAIN</b>
<b>more</b>
<b>stuff</b>
<b>down here!</b>
</html>
"""

Or already in text format, like:

top_html = 'Main Title more stuff down here!'

bottom_html = 'AGAIN more stuff down here!'

So I´ve been able to extract the <table> part of from the whole HTML and do my processing (I separate the rows <tr> and columns <td> so I can extract the values I need), with the following code:

soup = BeautifulSoup(input_html, "html.parser")
table = soup.find('table')

Viewing all articles
Browse latest Browse all 67441

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>