Marcus L Endicott: XML

Showing posts with label XML. Show all posts

04 September 2013

Dissecting the Summarization Process

This is in effect a mid-2013 progress update. As with many of my blog posts, this is as much a status update for me to get a better handle on where I'm at as it is to broadcast my progress.

mendicott.com is a blog reflecting on my journey with the overall project. This blog started seven years ago, in 2006, with my inquiry into The difference between a web page and a blog.... I had then returned from something like five years of world travel to find the digerati fawning over the blogosphere. At first, I failed to see the difference between a blog and a content management system (CMS) for stock standard web pages. Upon closer examination, I began to realize that the real difference lay in the XML syndication of blog feeds into the real-time web.

meta-guide.com is an attempt to blueprint, or tutorialize, the process. My original Meta Guide 1.0 development in ASP attempted to create automated, or robotic, web pages based on XML feeds from the real-time web. Meta Guide 2.0 development was based on similar feed bots, or Twitter bots, in an attempt to automate, or at least semi-automate, the rapid development of large knowledgebases from social media via knowledge silos. Basically, I use knowledge templates to automatically create the knowledge silos, or large knowledgebases. The knowledge templates are based on my own, proprietary "taxonomies", or more precisely faceted classifications, painstakingly developed over many years.

gaiapassage.com aims to be an automated, or semi-automated, summarization of the knowledge aggregated from social media by feed bots via the proprietary faceted classifications, or knowledge templates. Right now, I'm doing a semi-automated summarization process with Gaia Passage, which consists of automated research in the form of knowledge silos being "massaged" in different ways, but ultimately manually writing the summarization in natural language. This is allowing me to analyze and attempt to dissect the processes involved in order to gradually prototype automation. Summarization technologies, and in particular summarization APIs, are still in their infancy. Examples of currently available summarization technologies include automatedinsights.com and narrativescience.com. The overall field is often referred to as automatic summarization.

In the future, the Gaia Passage human readable summarizations will need to be converted into machine readable dialog system knowledgebase format. The dialog system is basically a chatbot, or conversational user interface (CUI) into a specialized database, called a knowledgebase. Most, common chatbot knowledgebases are based on, or compatible with, XML, such as AIML for example. Voice technologies, both output and input, are generally an additional layer on top of the text based dialog system.

The two main bottlenecks I've come up against are what I like to call artificial intelligence middleware, or frameworks, the "glue" to integrate the various processes, as well as adequate dialog system tools, in particular chatbot knowledgebase tools with both "frontend" and "backend" APIs (application programming interface), in other words a dialog system API on the frontend with a backend API into the knowledgebase for dynamic modification. My favorite cloud based "middleware" is Yahoo! Pipes, which is generally referred to as a mashup platform (aka mashup enabler) for feed based data; however, there are severe performance issues with Yahoo! Pipes -- so, I don't really consider it to be a production ready tool. Like Yahoo! Pipes, my ideal visual, cloud based AI middleware could or should be language agnostic -- eliminating the need to decide on a single programming language for a project. I have also looked into scientific computing packages, such as LabVIEW, Mathematica, and MATLAB, for use as potential AI middleware. Additionally, there are a variety of both natural language and intelligent agent frameworks available. Business oriented cloud based integration, including visual cloud based middleware, is often referred to as iPaaS (integration Platform as a Service), integration PaaS or "Integration as a Service".

The recent closure of the previously open Twitter API with OAuth has set my feed bot, or "smart feed", development back by years. Right now, I'm stuck trying to figure out the best way to use the new Twitter OAuth with Yahoo! Pipes, for instance via YQL, if at all. And if that were not enough, the affordable and user-friendly dialog system API, verbotsonline.com, that I was using went out of business. There are a number of dialog system API alternatives, even cloud based dialog systems, but they are neither free nor cheap, especially for significant throughput volumes. Still to do: 1) complete the Gaia Passage summarizations, 2) make Twitter OAuth work, use a commercial third party data source (such as datasift.com, gnip.com or topsy.com), or abandon Twitter as a primary source (for instance concentrate on other social media APIs instead, such as Facebook), 3) continue the search for a new and better dialog system API provider.

Most basically, the Gaia Passage project is a network of robots that will not only monitor social media buzz about both the environment and tourism but also interpret the inter-relations, cause and effects, between environment and tourism -- such as how climate change effects the tourism industry both negatively or positively, or even what effects the weather has on crime trends for a particular destination -- as well as querying these interpreted inter-relations, or "conclusions", via natural language. If this can be accomplished with any degree of satisfaction, either fully automated or semi-automated, then the system could just as easily be applied to any other vertical. Proposals from potential sponsors, investors, or technology partners are welcomed, and may be sent to mendicot [at] yahoo.com.

20 January 2008

Corpus linguistics & Concgramming in Verbots and Pandorabots

One of the definitions of semantic, as in Semantic Web or Web 3.0, is the property of language pertaining to meaning, meaning being significance arising from relation, for instance the relation of words. I don’t recall hearing about corpus linguistics before deciding to animate and make my book interactive. Apparently there has been a long history of corpus linguistics trying to derive rules from natural language, such as the work of George Kingsley Zipf. As someone with a degree in psychology, I do know something of cognitive linguistics and its reaction to the machine mind paradigm.

The man who coined the term, called artificial intelligence "the science and engineering of making intelligent machines,” which today is referred to as "the study and design of intelligent agents." Wikipedia defines intelligence as “a property of the mind that encompasses… the capacities to reason, to plan, to solve problems, to think abstractly, to comprehend ideas, to use language, and to learn.” Computational linguistics has emerged as an interdisciplinary field involved with “the statistical and/or rule-based modeling of natural language.”

In publishing, a concordance is an alphabetical list of the main words used in a text, along with their immediate contexts or relations. Concordances are frequently used in linguistics to study the body of a text. A concordancer is the program that constructs a concordance. In corpus linguistics, concordancers are used to retrieve sorted lists from a corpus or text. Concordancers that I looked at included AntConc, ConcordanceSoftware, WordSmith Tools and ConcApp. I found ConcApp and in particular the additional program ConcGram to be most interesting. (Examples of web based concordancers include KWICFinder.com and the WebAsCorpus.org Web Concordancer.)

Concgramming is a new computer-based method for categorising word relations and deriving the phraseological profile or ‘aboutness’ of a text or corpus. A concgram constitutes all of the permutations generated by the association of two or more words, revealing all of the word association patterns that exist in a corpus. Concgrams are used by language learners and teachers to study the importance of the phraseological tendency in language.

I was in fact successful in stripping out all the sentences from my latest book, VAGABOND GLOBETROTTING 3, by simply reformatting them as paragraphs with MSWord. I then saved them as a CSV file, actually just a text file with one sentence per line. I was able to make a little utility which ran all those sentences through the Yahoo! Term Extraction API, extracting key terms and associating those terms with their sentences in the form of XML output, as terms equal title and sentences equal description. Using the great XSLT editor xsl-easy.com, I could convert that XML output quickly and easily into AIML with a simple template.

The problem I encountered was that all those key terms extracted from my book sentences when tested formed something like second level knowledge that you couldn’t get out of the chatbot unless you already knew the subject matter…. So I then decided to try adding the concgrams to see if that could bridge the gap. I had to get someone to create a special program to marry the 2 word concgrams from the entire book (minus the 100 most common words in English) to their sentences in a form I could use.

It was only then that I began to discover some underlying differences between the verbotsonline.com and pandorabots.com chatbot engine platforms. I've been using verbotsonline because it seemed easier and cheaper, than adding a mediasemantics.com character to the pandorabot. However, there is a 2.5 Meg limit with verbotsonline knowledgebases, which I've reached three times already. Also, verbotsonline.com does not seem to accept multiple SAME patterns with different templates, at least the AIML-Verbot Converter apparently removes the “duplicate” categories.

In verbots, spaces automatically match to zero or more words, so wildcards are only necessary to match partial words. This means in verbots words are automatically wildcarded, which makes it much easier to achieve matches with verbots. So far, I have been unable to replicate this simple system with AIML, which makes AIML more precise or controllable, but perhaps less versatile, at least in this case. Even with the AIML knowledgebase replicated eight times with the following patterns, I could not duplicate the same results in pandorabots as the verbots do with one file, wildcarding on all words in a phrase or term.

dog cat
dog cat *
_ dog cat
_ dog cat *

dog * cat
dog * cat *
_ dog * cat
_ dog * cat *

The problem I encountered with AIML trying to “star” all words was that when starred at the beginning of a pattern only one word was accepted and not more words, and when replaced with the underscore apparently affects pattern prioritization. So there I am at the moment stuck between verbots and pandorabots, not being able to do what I want with either, verbotsonline for lack of capacity and inability to convert “duplicate” categories into VKB, and pandorabots for inability to conform to my fully wildcarded spectral word association strategy….

08 January 2008

Books, metadata and chatbots… in search of the XML Rosetta Stone

I am an author and I build chatbots (aka chatterbots). A chatbot is a conversational agent, driven by a knowledgebase. I am currently trying to understand the best way to convert a book into a chatbot knowledgebase.

A knowledgebase is a form of database, and the chatbot is actually a type of search… an anthropomorphic form of search and therefore an ergonomic form of search. This simple fact is usually shrouded by the jargon of “natural language processing”, which may or may not be actual voice input or output.

According to the ruling precepts of the “Turing test”, chatbots must be as close as possible to conversational, and this is what differentiates them from pure “search”…. With chatbots there is a significant element of “smoke and mirrors” involved, which introduces the human psychological element into the machine in the form of cultural, linguistic and thematic assumptions and expectations, so becoming in a sense a sort of “mind game”.

I’m actually approaching this from two directions. I would also like to be able to feed RSS into a chatbot knowledgebase. There is currently no working example of this available. Parsing RSS into AIML (Artificial Intelligence Markup Language), the most common chatbot dialect, is problematic and yet to be cracked effectively. So, my thinking arrived at somehow breaking a book into a form that resembles RSS. The Wikipedia List of XML markup languages revealed a number of attempts to add metadata to books.

Dr. Wallace, the originator of AIML, recently responded on the pandorabots-general group, that using RSS title fields would usually be too specific to make them useful as chatbot concept triggers. However, I believe utilities such as the Yahoo! Term Extraction API could be used to create tags for feed items, which might then prove more useful when mapped to AIML patterns….

My supposition is that a *good* book index is in effect a “taxonomy” of that book. Paragraphs would generally be too large to meet the specialized “conversational” needs of a chatbot. The results of a conventional concordance would be too general to be useful in a chatbot…. If RSS as we know it is currently too specific to function effectively in a chatbot, what if that index were mapped back to the referring sentences as “tags”, somewhat like RSS?

I figure that if you can relatively quickly break a book down into a sentence “concordance”, you could then point that at something like the Yahoo! Term Extraction API to quickly generate relevant keywords (or “tags”) for each sentence, which could then be used in AIML as triggers for those sentences in a chatbot…. Is there such a beast as a “sentence parser” for a corpus such as a common book? All I want to do at this point is strip out all the sentences and line them up, as a conventional concordance does with individual words.

There are a number of examples of desktop chatbots using proprietary Windows speech recognition today, however to my knowledge there are currently no chatbots available online or via VoIP that accept voice input (*not* IM or IRC bots)…. So, I’ve also spent some time lately looking into voiceXML (VXML), ccXML and the Voxeo callXML, as well as the Speech Recognition Grammar Specification (SRGS) and the mythical voice browser…. The only thing I could find that actually accepts voice input online for processing is Midomi.com, which accepts voice input in the form of hummed tune for tune recognition…. Apparently goog411, which is basically interactive voice response (IVR) rather than true speech recognition, is as close as it gets to a practical hybrid online/offline voice search application at this time. So, what if Google could talk?

22 December 2007

I'm dreaming of RSS in => AIML out

I am still trying to get my head around the relationship between chatbots and the Semantic Web, or Web 3.0.... Any thoughts or comments on the precise nature of this relationship are welcome.

Converting from VKB back into AIML was my first crash course in working with XML dialects.... Since then the old lightbulb has gone off, or rather "on" I should say, and it suddenly dawned on me that the whole hullabaloo about Web 2.0 largely centers on the exchange of metadata, most often in the form of RSS, another XML dialect.

I was really stoked to learn of the work of Eric Freese, apparently processing logic using the Jena framework then manually(?) converting that RDF into AIML; however, I continue to wait for word of his "Semetag/AIMEE" example at http://www.semetag.com .

My understanding is that it is quite do-able, as in off the shelf, to pull RSS into a database and accumulate it there.... Could such a database of RSS not be used as a potential knowledgebase for a chatbot?

The missing element seems to be the processing, or DL Reasoner(?).... I have been unable to find any reference to such a web-based, modular DL Reasoner yet....

http://www.knoodl.com seems to be the closest thing to a "Web 2.0-style" collaborative ontology editor, which is fine for creating ontologies collectively, however falls short of meeting the processing requirement.

In short, I'm dreaming of RSS in => AIML out. At this point I would be happy with a "toy" or abbreviated system just to begin playing around with all this affordably (not least time-wise). So it seems what's still needed is a simple, plug and play "Web 2.0-style" (or is that "Web 3.0" style?) web-based DL Reasoner that accepts common OWL ontologies, then automagically goes from RDF into AIML....