Satya Majumdar 1, Sergei K. Nechaev 1, 2
Physical Review E: Statistical, Nonlinear, and Soft Matter Physics 72 (2005) 020901
Finding analytically the statistics of the longest common subsequence (LCS) of a pair of random sequences drawn from c alphabets is a challenging problem in computational evolutionary biology. We present exact asymptotic results for the distribution of the LCS in a simpler, yet nontrivial, variant of the original model called the Bernoulli matching (BM) model which reduces to the original model in the large c limit. We show that in the BM model, for all c, the distribution of the asymptotic length of the LCS, suitably scaled, is identical to the Tracy-Widom distribution of the largest eigenvalue of a random matrix whose entries are drawn from a Gaussian unitary ensemble. In particular, in the large c limit, this provides an exact expression for the asymptotic length distribution in the original LCS problem.
- 1. Laboratoire de Physique Théorique et Modèles Statistiques (LPTMS),
CNRS : UMR8626 – Université Paris XI – Paris Sud - 2. L.D. Landau Institute for Theoretical Physics,
Landau Institute for Theoretical Physics