I am trying to come up with an algorithm that identififies if a string is part of the text content of an element or is it part of the element attributes.
For example:
<a class="tag tag-red-dark" href="/keywords?q=PARTOFATTRIBUTE"> Found TEXTCONTENT </a>
If you perform regex on TEXTCONTENT
or PARTOFATTRIBUTE
, you can run this algorithm to check if they are part of the text or part of the attributes:
MatchCollection matches = Regex.Matches(html, @"(?i)TEXTCONTENT");
for (int i = matches.Count-1; i >= 0 ; i--){
Match m = matches[i];
int currentIndex = m.Index;
bool isTextContent = false;
while (html[currentIndex] != '<'){
currentIndex--;
if (html[currentIndex] == '>'){
isTextContent = true;
break;
}
}
if (isTextContent){
// do something with text content
}else{
// do something with attribute
}
}
But the algorithm is fragile. If your html looks like this:
<a class="tag tag-red-dark" title="a>b" href="/keywords?q=PARTOFATTRIBUTE"> Found TEXTCONTENT </a>
PARTOFATTRIBUTE will be recognized as text, which is not.
Moreover, you could also have text with < in it, which makes the algorithm think that it found attribute:
<a class="tag tag-red-dark" title="a>b" href="/keywords?q=PARTOFATTRIBUTE"> < Found TEXTCONTENT </a>
Placing < in text without escaping is invalid html which i would like to handle. Placing > in attributes is on the other hand valid. Is it possible to determine if the selected string is part of attributes of text content solely based on the environment in which it is placed?