I'm actually working on a text-mining/semantic web application focused (for the moment) on biomedical informations and developed in Java.
We are using external tools for text-mining analysis and unfortunatly theses tools don't handle HTML pretty well ...
If we send raw HTML to the text-mining service, he simply break.
So we must convert HTML to plain-text before processing text, and because the tools return identified words by giving their positions, we must translate theses position (or indexes) to find corresponding word in the original HTML.
Bring Your Code
I've searched the web but didn't find any code that I can reuse, so I created a simple 'IndexTranslator' class.
I'm sure that my implementation of this algorithm isn't the best (a bit under pressure due to production deployment), so I'm looking for a better one.
Keep in mind that this code is used in an enterprise application and must be relatively easy to maintain (even it's a bit less efficient) .
I created a gist with the implementation and the corresponding JUnit test case to let you fork the code and see if it work as I expect.
Let's fork !
By the way, if you are aware of a library implementing this, I'm ready to send this code to the garbage and use the library instead !