There are different aspects of this to consider. Most natural language “engines” (called “dialog systems”) are text-based. Adding both audio-in and audio-out are additional levels of complexity. And adding an animated avatar system is yet another layer.
The short answer is that I know of no do-it-yourself kit for for doing this on the web. Traditionally such bundled “agents” were primarily for desktop machines (leveraging the Microsoft Speech API, for instance). Generally, both video and audio are web browser dependent, and the Browser wars made it difficult to make things work across all platforms (much like issues making cross-platform mobile apps today). It was only fairly recently that Google’s initially undocumented Chrome-based “Web Speech API” became available to developers.
Pandorabots CallMom™ is one DIY system that leverages the Google Speech API. I believe, CallMom also makes use of Pannous “Voice Actions” (aka Jeannie Voice Actions); both Voice Actions (billed “Siri as a service”) and Jeanie APIs are available via Mashape. See also Pannous on GitHub.