PetaMem Scripting Environment (PMSE) is a software suite that allows you to perform virtually
any task related to corpus linguistics.
PMSE has been designed to offer a comprehensive toolchain to provide the user with a very generic
way of working with text corpora - starting with the acquisition of data, any modification (like
format conversion), thorough statistical analysis and data visualization.
The intended audience for PMSE are universities, scientific institutes and research organizations.
Because of its thorough support for data formats, it can also should serve as a middleware for the
various existing tools and software - often enabling for their interoperability. PMSE is designed
to be language agnostic and will support the languages you use.
The software is thoroughly documented including reference and tutorials of real use cases of PMSE scripts.
Furthermore the mathematical characteristics of applied distance metrics are included.
You can download the PMSE documentation in PDF format here.
Functions & Features
PMSE is a functionally rich corpus tool for web scraping, file conversion, text cleaning; it also
contains methods for advanced statistical and linguistic analysis - including a generic tokenizer,
full UTF-8 support, concordancer, co-occurrence extractor, tool for keywords analysis and tool for
computing distances between pairs of n-grams. PMSE is also capable to perform a text categorization
based on hierarchical cluster analysis. Focus has been laid on generic, comprehensive and versatile
functionality. This led to computation of MI-score and T-score for n-grams of unrestricted length.
Besides using GraphViz, a deeper use of bridge from Perl to
R is tested right now.
For an example of use, take a look on brief walkthrough.
The most effective tool for text cleanup is P_rer (Regular Expression Replacer), which
enables the user to quickly edit the text(s) from CLI. If you know what you need to remove or replace,
you can specify your task with a Perl regular expression
and P_rer will do the job on the file. Several tasks may be specified via an INI file.
When the text is prepared, P_csp (Comprehensive Statistics Processor) and P_gnp (Generic N-grams Processor)
are here to get some data. P_csp extracts frequencies of occurrence,
probabilities of occurrence, counts of occurrence and histograms of these values for unigrams. P_gnp extracts similar information
for N-grams. Furthermore, P_gnp counts MI-score and
T-score for arbitrarily long N-grams. P_gnp also counts a rank and
P_dmp counts the distance between two N-grams. Around 20 distance metrics
(including well known euclidean or
Mahalanobis distance) are implemented for now.
Acquired data is processed by P_dvf (Data Visualization Framework). P_dvf is able to convert data in various formats
(YAML, printed Perl data structure, Storable,
'pure text'). Data may be easily sorted, filtered and inspected.
Developers work now on automated visualization of the main data-types. The idea is to build a complete tool-chain of actions,
which will lead from primary command to the graphical result.
Management of processes
P_ici (Intelligent Command Iterator) provides several options for the management of processes, including
The P_ici script comes in handy if you need to make a series of operations that differ in some specific input argument.The
strategy "Handle smaller files in parallel processes" is the key to saving time.
PMSE comes with very modest licensing fees and liberal licensing practice. It will allow for institution-wide use
without the hassle of keeping numbers of concurrent or named users. There is also a per-workstation licensing model
which is even more affordable if you have less than 15 users. Inclded is of course a free update service and support.