Tuesday, 21.08.2012 17 c.t.

Burstiness and long-range correlation in natural language

by Dr. Eduardo G. Altmann
from Dynamical system and Social Dynamics Max Planck Institute for the Physics of Complex Systems, Dresden

Ludwig Prandtl lecture hall


Recent temporal analysis of different large-scale databases of human activities show that two ubiquitous patterns are the intermittency in the occurrence of events (burstiness) and correlations on arbitrarily long times. Natural language is a prominent human activity that not only creates these temporal patterns but also reproduces the patterns of external events. In this talk I'll discuss how these two phenomena relate to each other on different linguistic scales. In particular, we explain the correlations observed in different low-level encodings (ASCII,letters, vowels, etc.) of texts by tracing their origin to the burstiness of specific words. We discuss how this burstiness depends on the semantics of the words and on the authors of the texts, and can be used in practical applications such as document classification and authorship recognition. Beyond this analysis, which is based mostly on literary texts, I'll report also investigations of online discussion groups which take into account the heterogeneity of the language at the level of users and topics. An important part of the information modern databases reveal about human activities originates from the communication between different persons. An understanding of the dynamical properties of the language used in this communication is therefore essential also for many of the recent applications of statistical physics and dynamical systems methods to "big data".

