THESEUS(1)		       Likelihood Rocks 		    THESEUS(1)



NAME
       theseus - Maximum likelihood, multiple simultaneous superpositions with
       statistical analysis

SYNOPSIS
       theseus	[-aAbBcCdDeEfFgGhHiIjklLmMnNoOpPqQrRsStTuvVwWxXyYZ]   pdbfile1
       [pdbfile2 ...]

       and

       theseus_align   [-aAbBcCdDeEfFgGhHiIjklLmMnNoOpPqQrRsStTuvVwWxXyYZ]  -f
       pdbfile1 [pdbfile2 ...]

       Default usage is equivalent to:

       theseus -a0 -e2 -g1 -i200 -k-1 -p1e-7 -r theseus -v -P0 your.pdb

DESCRIPTION
       Theseus superpositions a set of	macromolecular	structures  simultane-
       ously using the method of maximum likelihood (ML), rather than the con-
       ventional least-squares criterion.  Theseus assumes that the structures
       are  distributed  according  to a matrix Gaussian distribution and that
       the eigenvalues of the atomic covariance matrix are hierarchically dis-
       tributed according to an inverse gamma distribution. This ML superposi-
       tioning model produces much more accurate results by essentially  down-
       weighting variable regions of the structures and by correcting for cor-
       relations among atoms.

       Theseus operates in two main modes, a mode for superimposing structures
       with  identical	sequences  and	a  mode  for structures with different
       sequences but similar structures:

	      (1) A mode for superpositioning  macromolecules  with  identical
	      sequences and numbers of residues, for instance, multiple models
	      in an NMR family or multiple structures from  different  crystal
	      forms of the same protein. In this mode, Theseus will read every
	      model in every file on the command line and superposition  them.

	      Example:

	      theseus 1s40.pdb

	      In the above example, 1s40.pdb is a pdb file of 10 NMR models.

	      (2)  An  "alignment"  mode  for superpositioning structures with
	      different sequences, for example,  multiple  structures  of  the
	      cytochrome  c protein from different species or multiple mutated
	      structures of hen egg white lysozyme.  This  mode  requires  the
	      user to supply a sequence alignment file of the structures being
	      superpositioned (see option -A and "FILE FORMATS" below).  Addi-
	      tionally,  it  may  be  necessary to supply a mapfile that tells
	      theseus which PDB structure files correspond to which  sequences
	      in  the alignment (see option -M and "FILE FORMATS" below). When
	      superpositioning based on a seqeunce alignment, theseus  uses  a
	      novel maximum likelihood algorithm for superpositioning multiple
	      structures that include arbitrary gaps and  insertions  relative
	      to  each other.  Unlike other algorithms for simultaneous super-
	      positioning of multiple structures, our Expectation-Maximization
	      algorithm  uses  all  available  data  by including all residues
	      aligned with gaps in the calculations.  In this mode,  if  there
	      are multiple structural models in a PDB file, theseus only reads
	      the first model in each file  on	the  command  line.  In  other
	      words,  theseus treats the files on the command line as if there
	      were only one structure per file.

	      Example 1:

	      theseus -A  cytc.aln  -M	cytc.filemap  d1cih__.pdb  d1csu__.pdb
	      d1kyow_.pdb

	      In  the above example, d1cih__.pdb, d1csu__.pdb, and d1kyow_.pdb
	      are pdb files of cytochrome c domains from the SCOP database.

	      Example 2:

	      theseus_align -f d1cih__.pdb d1csu__.pdb d1kyow_.pdb

	      In this example, the theseus_align script is called  to  do  the
	      hard  work  for you.  It will calculate a sequence alignment and
	      then superimpose based  on  that	alignment.   The  script  the-
	      seus_align takes the same options as the theseus program.  Note,
	      the first few lines of this script must  be  modified  for  your
	      system,  since  it calls an external multiple sequence alignment
	      program to do the alignment.  See the  examples/	directory  for
	      more details, including example files.

OPTIONS
   Algorithmic options, defaults in {brackets}:
       --amber
	      Do special processing for AMBER8 formatted PDB files

	      Most  people will never need to use this long option, unless you
	      are processing MD traces from AMBER.  AMBER puts the atom  names
	      in the wrong column in the PDB file.


       -a [selection]
	      Atoms  to  include  in the superposition.  This option takes two
	      types of arguments, either (1) a number specifying a preselected
	      set  of atom types, or (2) an explict PDB-style, colon-delimited
	      list of the atoms to include.

	      For the preselected atom type  subsets,  the  following  integer
	      options are available:

	       o 0, alpha carbons for proteins, C1' atoms for nucleic acids
	       o 1, backbone
	       o 2, all
	       o 3, alpha and beta carbons
	       o 4, all heavy atoms (no hydrogens)

	      Note,  only  the	-a0  option is available when superpositioning
	      structures with different sequences.

	      To custom select an explicit set of atom types, the  atom  types
	      must  be	specified  exactly  as	given  in  the PDB file field,
	      including spaces, and the atom-types must encapsulated in quota-
	      tion  marks.   Multiple atom types must be delimited by a colon.
	      For example,

	      -a' N  : CA : C  : O  '

	      would specify the atom types in the peptide backbone.



       -c     Use ML  atomic  covariance  weighting  (fit  correlations,  much
	      slower)

	      Unless  you  have  many  different structures with few residues,
	      fitting the correlation matrix is likely	unwarranted  statisti-
	      cally due to a plethora of parameters and a paucity of data.


       -e [n] Embedding algorithm for initializing the average structure
	       o 0 = none; use randomly chosen model
	       o {2} = {ML embedded structure}


       -f     Only read the first model of a multi-model PDB file


       -g [n] Hierarchical model for variances
	       o 0 = none (may not converge)
	       o {1} = inverse gamma distribution


       -h     Help/usage


       -i [nnn]
	      Maximum iterations, {200}


       -k [n] constant	minimum  variance  {-1} {if set to negative value, the
	      minimum variance is determined empirically}


       -p [precision]
	      Requested relative precision for convergence, {1e-7}


       -r [root name]
	      Root name to be used in naming the output files, {theseus}


       -s [n-n:...]
	      Residue selection (e.g. -s15-45:50-55), {all}


       -S [n-n:...]
	      Residues to exclude (e.g. -S15-45:50-55) {none}

	      The previous two options	have  the  same  format.  Residue  (or
	      alignment column) ranges are indicated by beginning and end sep-
	      arated by a dash.  Multiple ranges, in any arbitrary order,  are
	      separated by a colon.  Chains may also be selected by giving the
	      chain ID immediately preceding the residue range.  For  example,
	      -sA1-20:A40-71  will  only  include residues 1 through 20 and 40
	      through 70 in chain A. Chains cannot be specified when  superpo-
	      sitioning structures with different sequences.


       -v     use ML variance weighting (no correlations) {default}


   Input/output options:
       -A [sequence alignment file]
	      Sequence	alignment  file to use as a guide (CLUSTAL or A2M for-
	      mat)

	      For  use	when  superpositioning	 structures   with   different
	      sequences.  See "FILE FORMATS" below.


       -E     Print expert options


       -F     Print FASTA files of the sequences in PDB files and quit

	      A  useful option when superpositioning structures with different
	      sequences.  The files output with this  option  can  be  aligned
	      with  a  multiple  sequence alignment program such as CLUSTAL or
	      MUSCLE, and the resulting output alignment file used as  theseus
	      input with the -A option.


       -h     Help/usage


       -I     Just calculate statistics for input file; don't superposition


       -M [mapfile]
	      File that maps PDB files to sequences in the alignment.

	      A  simple  two-column  formatted file; see "FILE FORMATS" below.
	      Used with mode 2.


       -n     Don't write transformed pdb file


       -o [reference structure]
	      Reference file to superposition on, all rotations  are  relative
	      to the first model in this file

	      For   example,   'theseus   -o   cytc1.pdb  cytc1.pdb  cytc2.pdb
	      cytc3.pdb' will superposition  the  structures  and  rotate  the
	      entire  final superposition so that the structure from cytc1.pdb
	      is in the same orientation as  the  structure  in  the  original
	      cytc1.pdb PDB file.


       -O     Olve's segID file

	      Useful  output  when  superpositioning structures with different
	      sequences (mode  2).   In  'theseus_sup.pdb',  the  main	output
	      superposition  PDB file, the segID field now holds the number of
	      the sequence alignment column that it belongs to.  This  number,
	      divided  by  100,  is  also  echoed in the B-factor field.  When
	      using O (or any other capable molecular visualization  program),
	      one can then color by B-factor ranges and immediately see in the
	      superposition which regions of the structure are aligned in  the
	      sequence	alignment  file.   An  additional file is also output,
	      called 'theseus_olve.pdb' which only  contains  the  very  atoms
	      that  were  included  in the ML superposition calculation.  That
	      is, it will only contain alpha carbons or phosphorous atoms, and
	      it will only contain atoms from the columns selected with the -s
	      or "-S" options.	Requested by Olve Peersen  of  Colorado  State
	      University.


       -V     Version


   Principal components analysis:
       -C     Use covariance matrix for PCA (correlation matrix is default)


       -P [nnn]
	      Number of principal components to calculate {0}


	      In  both	of the above, the corresponding principal component is
	      written in the B-factor field of the output  PDB	file.  Usually
	      only the first few PCs are of any interest (maybe up to six).

	       EXAMPLES theseus 2sdf.pdb


       theseus -l -r new2sdf 2sdf.pdb


       theseus -s15-45 -P3 2sdf.pdb


       theseus	-A  cytc.aln  -M  cytc.mapfile	-o  cytc1.pdb -s1-40 cytc1.pdb
       cytc2.pdb cytc3.pdb cytc4.pdb

ENVIRONMENT
       You can set the environment variable 'PDBDIR' to your PDB  file	direc-
       tory  and  theseus will look there after the present working directory.
       For example, in the C shell (tcsh or csh), you can put  something  akin
       to this in your .cshrc file:

       setenv PDBDIR '/usr/share/pdbs/'


FILE FORMATS
       Theseus	  will	  read	  standard    PDB    formatted	  files   (see
       <http://www.rcsb.org/pdb/>).  Every effort has been made for  the  pro-
       gram to accept nonstandard CNS and X-PLOR file formats also.

       Two  other  files deserve mention, a sequence alignment file and a map-
       file.


   Sequence alignment file
       When superpositioning  structures  with	different  residue  identities
       (where  the lengths of each the macromolecules in terms of residues are
       not necessarily equal), a sequence alignment file must be included  for
       theseus	to  use  as  a	guide  (specified  by the -A option).  Theseus
       accepts both CLUSTAL and A2M (FASTA) formatted multiple sequence align-
       ment files.


       NOTE  1:  The  residue sequence in the alignment must match exactly the
       residue sequence given in the coordinates of the  PDB  file.  That  is,
       there can be no missing or extra residues that do not correspond to the
       sequence in the PDB file. An easy way to  ensure  that  your  sequences
       exactly match the PDB files is to generate the sequences using theseus'
       -F option, which writes out a FASTA  formatted  sequence  file  of  the
       chain(s)  in  the PDB files. The files output with this option can then
       be aligned with a multiple sequence alignment program such  as  CLUSTAL
       or  MUSCLE,  and  the  resulting  output alignment file used as theseus
       input with the -A option.


       NOTE 2: Every PDB file must have a corresponding sequence in the align-
       ment.   However,  not  every  sequence in the alignment needs to have a
       corresponding PDB file. That is, there can be extra  sequences  in  the
       alignment that are not used for guiding the superposition.


   PDB -> Sequence mapfile
       If  the	names  of  the	PDB  files  and the names of the corresponding
       sequences in the alignemnt are identical, the mapfile may  be  omitted.
       Otherwise,  Theseus needs to know which sequences in the alignment file
       correspond to which PDB structure files. This information  is  included
       in  a mapfile with a very simple format (specified with the -M option).
       There are only two columns separated by whitespace:  the  first	column
       lists  the  names  of  the PDB structure files, while the second column
       lists the corresponding sequence names exactly as given in the multiple
       sequence alignment file.

       An example of the mapfile:

       cytc1.pdb    seq1
       cytc2.pdb    seq2
       cytc3.pdb    seq3


SCREEN OUTPUT
       Theseus	provides output describing both the progress of the superposi-
       tioning and several statistics for the final result:


       Least-squares <sigma>:
	      The standard deviation for the superposition, based on the  con-
	      ventional  assumption  of  no  correlation  and equal variances.
	      Basically equal to the RMSD from the average structure.


       Classical LS pairwise <RMSD>:
	      The conventional RMSD for the superposition,  the  average  RMSD
	      for all pairwise combinations of structures in the ensemble.


       Maximum Likelihood <sigma>:
	      The  ML  analog of the standard deviation for the superposition.
	      When assuming that the correlations are zero (a diagonal covari-
	      ance  matrix),  this is equal to the square root of the harmonic
	      average of the variances for each atom. In contrast, the 'Least-
	      squares  <sigma>'  given	above  reports	the square root of the
	      arithmetic average of the variances.  The  harmonic  average  is
	      always  less than the arithmetic average, and the harmonic aver-
	      age downweights large values proportional  to  their  magnitude.
	      This  makes  sense  statistically, because when combining values
	      one should weight them  by  the  reciprocal  of  their  variance
	      (which is in fact what the ML superpositioning method does).


       Log Likelihood:
	      The  final  log  likelihood  of  the superposition, assuming the
	      matrix Gaussian distribution of the structures and  the  hierar-
	      chical  inverse  gamma  distribution  of	the eigenvalues of the
	      covariance matrix.


       AIC:   The Akaike Information Criterion for  the  final	superposition.
	      This  is an important statistic in likelihood analysis and model
	      selection theory. It allows an objective comparison of  multiple
	      theoretical models with different numbers of parameters. In this
	      case, the higher the number the  better.	There  is  a  tradeoff
	      between  fit to the data and the number of parameters being fit.
	      Increasing the number of parameters in a model will always  give
	      a  better fit to the data, but it also increases the uncertainty
	      of the estimated values.	The AIC criterion finds the best  com-
	      bination	by  (1) maximizing the fit to the data while (2) mini-
	      mizing the uncertainty due to the number of parameters.  In  the
	      superposition case, one can compare the least squares superposi-
	      tion to the maximum likelihood  superposition.  The  method  (or
	      model) with the higher AIC is preferred. A difference in the AIC
	      of 2 or more is considered strong statistical evidence  for  the
	      better model.


       BIC:   The Bayesian Information Criterion. Similar to the AIC, but with
	      a Bayesian emphasis.


       Rotational, translational, covar chi^2:
	      The reduced chi-squared statistic for the fit of the  structures
	      to  the model.  With a good fit it should be close to 1.0, which
	      indicates a perfect fit of the data to  the  statistical	model.
	      In  the  case  of  least-squares,  the assumed model is a matrix
	      Gaussian distribution of the structures with equal variances and
	      no correlations.	For the ML fits, the assumed models can either
	      be (1) unequal variances and no correlations, as calculated with
	      the  -v  option  [default] or (2) unequal variances and correla-
	      tions, as calculated with the -c option.	This statistic is  for
	      the  superposition  only,  and  does  not include the fit of the
	      covariance matrix eigenvalues to an inverse gamma  distribution.
	      See 'Omnibus chi^2' below.


       Hierarchical minimum var:
	      The  hierarchical  fit  of  the  inverse gamma distribution con-
	      strains the variances of the atoms by making large ones  smaller
	      and  small ones larger.  This statistic reports the minimum pos-
	      sible variance given the inferred inverse gamma parameters.


       Hierarchical var (alpha, gamma) chi^2:
	      The reduced chi-squared for the inverse gamma fit of the covari-
	      ance  matrix  eigenvalues. As before, it should ideally be close
	      to 1.0.  The two values in the parentheses are the ML  estimates
	      of the scale and shape parameters, respectively, for the inverse
	      gamma distribtuion.


       Omnibus chi^2:
	      The overall reduced chi-squared statistic for  the  entire  fit,
	      including  the  rotations,  translations,  covariances,  and the
	      inverse gamma parameters. This is probably  the  most  important
	      statistic  for  the  superposition.  In  some cases, the inverse
	      gamma fit may be poor, yet the overall fit is still  very  good.
	      Again, it should ideally be close to 1.0, which would indicate a
	      perfect fit. However, if you think it is too large, make sure to
	      compare it to the chi^2 for the least-squares fit; it's probably
	      not that bad after all.  A large chi^2 often indicates a	viola-
	      tion of the assumptions of the model.  The most common violation
	      is when superpositioning two or more  independent  domains  that
	      can  rotate  relative  to  each other. If this is the case, then
	      there will likely be not just  one  Gaussian  distribution,  but
	      several mixed Gaussians, one for each domain.  Then, it would be
	      better to superposition each domain independently.


       skewness, skewness Z-value, kurtosis & kurtosis Z-value:
	      The skewness and kurtosis of the residuals. Both should  be  0.0
	      if  the  residuals  fit a Gaussian distribution perfectly.  They
	      are followed by the P-value for the statistics. This is  a  very
	      stringent  test;	residuals can be very non-Gaussian and yet the
	      estimated rotations, translations,  and  covariance  matrix  may
	      still be rather accurate.


       FP error in transformed coordinates:
	      The  empirically	determined floating point error in the coordi-
	      nates after rotation and translation.


       Minimum RMSD error per atom:
	      The empirically determined minimum RMSD error per atom, based on
	      the floating point error of the computer.


       Data pts, Free params, D/P:
	      The  total  number of data points given all observed structures,
	      the number of parameters being fit in the model, and  the  data-
	      to-parameter ratio.


       Median structure:
	      The structure that is overall most similar to the average struc-
	      ture. This can be considered to be the most "typical"  structure
	      in the ensemble.


       Total rounds:
	      The number of iterations that the algorithm took to converge.


       Fractional precision:
	      The actual precision that the algorithm converged to.


OUTPUT FILES
       Theseus writes out the following files:


       theseus_sup.pdb
	      The  final  superposition,  rotated to the principle axes of the
	      mean structure.


       theseus_ave.pdb
	      The estimate of the mean structure.


       theseus_cor.mat, theseus_cov.mat
	      The atomic correlation matrix and covariance matrices, based  on
	      the  final  superposition.  The  format is suitable for input to
	      GNU's octave.  These are the matrices used in the Principal Com-
	      ponents Analysis.


       theseus_embed_ave.pdb
	      The  average structure as calculated by S. Lele's EDMA embedding
	      algorithm, used as the starting point for the maximum likelihood
	      iterations.


       theseus_residuals.txt
	      The normalized residuals of the superposition. These can be ana-
	      lyzed for deviations from normality (whether they fit a standard
	      Gaussian	distribution). E.g., the chi^2, skewness, and kurtosis
	      statistics are based on these values.


       theseus_transf.txt
	      The final transformation rotation matrices and translation  vec-
	      tors.


       theseus_variances.txt
	      The vector of estimated variances for each atom.


       When Principal Components are calculated (with the -P option), the fol-
       lowing files are also produced:


       theseus_pcvecs.txt
	      The principal component vectors.


       theseus_pcstats.txt
	      Simple statistics for each principle component (loadings,  vari-
	      ance explained, etc.).


       theseus_pcN_ave.pdb
	      The  average  structure with the Nth principal component written
	      in the temperature factor field.


       theseus_pcN.pdb
	      The final superposition with the Nth principal component written
	      in  the  temperature  factor  field.   This file is omitted when
	      superpositioning	molecules  with  different  residue  sequences
	      (mode 2).



BUGS
       Please send me (DLT) reports of all problems.


RESTRICTIONS
       Theseus	is  not  a  structural alignment program.  The structure-based
       alignment problem is completely different from the structural  superpo-
       sition  problem.  In order to do a structural superposition, there must
       be a 1-to-1 mapping that associates the atoms in one structure with the
       atoms  in  the other structures.  In the simplest case, this means that
       structures must have equivalent numbers of atoms, such as the models in
       an   NMR   PDB	file.	 For  structures  with	different  numbers  of
       residues/atoms, superpositioning is only possible  when	the  sequences
       have  been  aligned  previously.   Finding  the best sequence alignment
       based on only structural information is a difficult  problem,  and  one
       for which there is currently no maximum likelihood approach.  Extending
       theseus to address the  structural  alignment  problem  is  an  ongoing
       research project.


AUTHOR
       Douglas L. Theobald
       dtheobald@brandeis.edu


CITATION
       When using theseus in publications please cite:


       Douglas L. Theobaldand Phillip A. Steindel (2012)
       "Optimal  simultaneous  superpositioning  of  multiple  structures with
       missing data."
       Bioinformatics 28(15):1972-1979

       The following papers also report theseus developments:


       Douglas L. Theobald and Deborah S. Wuttke (2006)
       "Empirical Bayes models for regularizing maximum likelihood  estimation
       in the matrix Gaussian Procrustes problem."
       PNAS 103(49):18521-18527


       Douglas L. Theobald and Deborah S. Wuttke (2006)
       "THESEUS:  Maximum  likelihood  superpositioning and analysis of macro-
       molecular structures."
       Bioinformatics 22(17):2171-2172


       Douglas L. Theobald and Deborah S. Wuttke (2008)
       "Accurate structural correlations from  maximum	likelihood  superposi-
       tions."
       PLoS Computational Biology 4(2):e43


HISTORY
       Long, tedious, and sordid.



Brandeis University		11 October 2012 		    THESEUS(1)
