PetaMem Scripting Environment

PetaMem Scripting Environment (PMSE) is a software suite that allows you to perform virtually any task related to corpus linguistics. PMSE has been designed to offer a comprehensive toolchain to provide the user with a very generic way of working with text corpora - starting with the acquisition of data, any modification (like format conversion), thorough statistical analysis and data visualization.


The intended audience for PMSE are universities, scientific institutes and research organizations. Because of its thorough support for data formats, it can also should serve as a middleware for the various existing tools and software - often enabling for their interoperability. PMSE is designed to be language agnostic and will support the languages you use.


The software is thoroughly documented including reference and tutorials of real use cases of PMSE scripts. Furthermore the mathematical characteristics of applied distance metrics are included.

You can download the PMSE documentation in PDF format here.

Functions & Features

PMSE is a functionally rich corpus tool for web scraping, file conversion, text cleaning; it also contains methods for advanced statistical and linguistic analysis - including a generic tokenizer, full UTF-8 support, concordancer, co-occurrence extractor, tool for keywords analysis and tool for computing distances between pairs of n-grams. PMSE is also capable to perform a text categorization based on hierarchical cluster analysis. Focus has been laid on generic, comprehensive and versatile functionality. This led to computation of MI-score and T-score for n-grams of unrestricted length.

Besides using GraphViz, a deeper use of bridge from Perl to R is tested right now. For an example of use, take a look on brief walkthrough.

Text Cleaning

The most effective tool for text cleanup is P_rer (Regular Expression Replacer), which enables the user to quickly edit the text(s) from CLI. If you know what you need to remove or replace, you can specify your task with a Perl regular expression and P_rer will do the job on the file. Several tasks may be specified via an INI file.

Data Mining

When the text is prepared, P_csp (Comprehensive Statistics Processor) and P_gnp (Generic N-grams Processor) are here to get some data. P_csp extracts frequencies of occurrence, probabilities of occurrence, counts of occurrence and histograms of these values for unigrams. P_gnp extracts similar information for N-grams. Furthermore, P_gnp counts MI-score and T-score for arbitrarily long N-grams. P_gnp also counts a rank and keywords.

P_dmp counts the distance between two N-grams. Around 20 distance metrics (including well known euclidean or Mahalanobis distance) are implemented for now.


Acquired data is processed by P_dvf (Data Visualization Framework). P_dvf is able to convert data in various formats (YAML, printed Perl data structure, Storable, 'pure text'). Data may be easily sorted, filtered and inspected.

Developers work now on automated visualization of the main data-types. The idea is to build a complete tool-chain of actions, which will lead from primary command to the graphical result.

Management of processes

P_ici (Intelligent Command Iterator) provides several options for the management of processes, including parallelization. The P_ici script comes in handy if you need to make a series of operations that differ in some specific input argument.The strategy "Handle smaller files in parallel processes" is the key to saving time.


PMSE comes with very modest licensing fees and liberal licensing practice. It will allow for institution-wide use without the hassle of keeping numbers of concurrent or named users. There is also a per-workstation licensing model which is even more affordable if you have less than 15 users. Inclded is of course a free update service and support.


