This section should give you an Overview about the basic features of the
PetaMem LangSuite System.
Basic Features of the NLP-Core
The NLP-Core is the central component of our natural language processing
system. It provides basic functionality and is necessary in every configuration.
The architecture of the system is a distributed client/server model. The system
runs on all relevant UNIX-derivatives (Solaris, AIX, HP-UX, Linux) and is scalable
on hardware from a classical workstation over beowulf-clusters up to modern
mainframe architectures. The internal encoding and all system relevant parameters
are designed for the processing of arbitrary natural languages (including arabic
and asiatic char systems).
Identification of the Language of a Text
A central feature of NLP-systems is the robust classification of input
data. This includes a correct assignment of the language of the text itself,
PetaMem utilizes no less than four methods for the identification of
a given text. These methods range from fast methods with broad language
coverage based on statistical methods, over dictionary-based robust
identification up to deep semantic analysis of a given text.
These methods are applied in a cascade-like way to ensure fast and
reliable operation. With deep semantic analysis a german text containing an
english citation which itself is much longer than the text, but containing
a short german excerpt of the english text is classified as a german and
not as an english text.
Communication Interfaces & Data Formats
The functionality of the central NLP components can be made accessible via
various interfaces. THe following list should give an overview about the
currently available interfaces as well as the resulting application areas
of the whole system:
- Mail
- The mail interface allows for a connection to any SMTP-able mail system
as well as the management of an arbitrary amount of mail peers
(folders). As incoming mails are all formats according to RFC 822, 2822 and
1341 accepted. Nevertheless the system tries to process malformed emails as
robust as possible. Accepted data formats are ASCII, DOC, RTF, UTF8,
HTML, PDF, PS and others.
The generation of answer mails takes place strictly according to RFC with help
of the mail system installed on the host.
Moreover the mail interface takes care of a efficient archiving and access of
previous correspondence according to various criteria and allows that way
for a continuous discourse based on previous dialogue history.
- Web
- The implemented web interface is also designed for bidirectional data streams.
Input takes place via classical HTML forms, Javascript or JAVA, output of
outgoing data is served HTML compliant by means of a web server.
- GUI/CLI
- Detailed documentation of the APIs allows for access of the system functionality
from the command line as well as from graphical user interfaces. With this
a broad spectrum of applications is possible. Starting with the integration of
powerfull spell checkers and thesauri up to human quality machine translations.
For all available interfaces does apply, that the underlying control logic allows for
a matrix-like assignment of the data streams. This way all combinations are possible.
E.g. some web input generating an apropriate answering email or incoming emails
affecting web content apropriatedly.
The system uses Unicode natively and is thus compatible to all national character
encoding systems. Conversions in and from specific coding systems are done if
required.
Summary Technical Data
|
Hardware & Architecture |
Operating System | Solaris, AIX, HP-UX, Linux |
Hardware | UltraSparc64, x86, pa-risc, p-series |
Clustering | Beowulf-Cluster |
Architecture | Distributed Client/Server |
|
|
|