Are you sure you are getting html from the XML and not wikitext?
Assuming you are working with wikitext, and want everything up to the first heading, and handwaving things like a section set by a template, you could break at the first line matching /^(=={2,5})[ \\t]*(.+?)[ \\t]*\1\\s*$/m (see the function Parser::doHeadings below).
In practice, splitting at "\n==" will give you the right result on 99% of articles.
If the library is really giving you html, it's even easier, split the html at the first <h[1-6]>.
Note that the wikitext will contain many non-textual characters like templates, tables, wikitext formatting, references... that you'd need to clean up before applying your models.
However, other projects have done this in the past (sorry, I have no links to them), so I would either make a very basic cleaning, or reuse what others made.
Best regards