StupidBeauty
Read times:1917Posted at:Mon Apr 4 20:06:06 2011
- no title specified

PHYLIP3.69文档翻译:主文档

节选了其中的部分内容进行翻译 。此文章的原文可在Kate dNA的博客上找到: http://stupidbeauty.com/KNA/2011/04/ phylip3-69文档翻译:主文档/

文章很长 ,所以一点点的翻译。当前进度:第 1 5 页,总页数: 5 2 .

PHYLIP

系统发生推论软件包(Phylogeny Inference Package

PHYLIP Logo

版本号3.69

2009年9月

作者Joseph Felsenstein

基因科学系与生物系
华盛顿大学
355065信箱
西 雅图,华盛顿 98195-5065
米国

电子邮件地址 joe (at) gs.washington.edu

这个文档的内容

内容目录

PHYLIP3.69文档翻译:主文档

PHYLIP

系统发生推论软件包(Phylogeny Inference Package)

版本号3.69

2009年9月

作者Joseph Felsenstein

电子邮件地址:joe (at) gs.washington.edu

这个文档的内容

对这些程序的简短介绍

文档文件以及如何阅读它们

这些程序做什么用

运行这些程序

说说输入文件。

在一个Unix 或者Linux 系统上运行这些程序。

在后台运行或者通过一个脚本文件来控制这些程序的运行

准备输入文件

输入和输出文件

数据文件的格式

The Menu

The Output File

The Tree File

The Options and How To Invoke Them

Common options in the menu

The Algorithm for Constructing Trees

Local rearrangements

Global rearrangements

Multiple jumbles

Saving multiple tied trees

Strategy for finding the best tree

A Warning on Interpreting Results

General Comments on Adapting

Compiling the programs

Unix and Linux

Parallel computers

Other computer systems

Frequently Asked Questions

Problems that are encountered

How to make it do various things

Background information needed:

Questions about distribution and citation:

Additional Frequently Asked Questions, or: "Why didn't it occur to you to ...

New Features in This Version

Coming Attractions, Future Plans

Other Phylogeny Programs Available Elsewhere

How You Can Help Me

In Case of Trouble

对这些程序的简短介绍

PHYLIP ,系统发生推论软件包 ,是一个由那些进行系统发生论 (进化树)推论的程序组成的软件包。它是从1980 年开始发布的 ,有 20,000 多个注册用户 ,使得它成为与系统发生论有关的发布得最广泛的 软件 包。可以从它的网站免费获得:

http://evolution.gs.washington.edu/phylip.html

PHYLIP以C 源代码的方式发布,另外还以某些常用的系统的可执行程序的方式发布 。它可使用以下方法进行系统发生论的推论 :简约法 、兼容性法、距离矩阵法和似然性法。它还可以计算一致树 (consensus trees)、计算树之间的距离、画树、通过自 举(bootstrapping)或剪切(jackknifing)来对数据集进行重新抽样、编辑树 、以及计算距离矩阵。它可以处理以下数据 :核苷酸序列 、蛋白质序列、基因频率、 限制性内切位点、 限制酶断片、距离、离散字符、和连续性字符。

文档文件以及如何阅读它们

PHYLIP带有丰富的文档 。其中包括主文档 (就是你在读的这个),你应当完整地读一读它 。另外 ,还有针对那些程序组的文档,包括针对 分子序列 程序组 距离矩阵程序组 基因频率及连续性字符程序组 离散字符程序组、和 画树程序组的文档。最后 ,每个程序都有它自己的文档。对所有文档的引用 都位于这个主文档里。你可以这样阅读它们

  1. 1.阅读这个主文档

  2. 2.试着决定哪些程序是你感兴趣的。

  3. 3.阅读包含这些程序的程序组的文档

  4. 4.阅读针对那些单个程序的文档

另外还有一个关于使用PHYLIP 3.6 的优秀教程。它是芬兰的埃斯坡的科学计算中心的Jarno Tuimala 写的,是PDF 文档,在 这里 下载。

这些程序做什么用

这里有一个针对每个程序的简短描述。对于更详细的说明 ,你当然应该阅读那个程序自己的文档以及那个程序所在的程序组的文档。在这个列表中 ,每个程序的名字都是一个链接,会带你到达那个程序的文档。注意,在PHYLIP 软件包中没有叫PHYLIP 的程序。

Protpars

使用简约法从蛋白质序列(使用标准的单个字母的氨基酸代码作为输入)估计系统发生史,并且以一种改动过 的方法来进行处理:只对那些会改变氨基酸的核苷酸变化进行计数,假设隐性变异 (silent changes)更容易发生。

Dnapars

使用简约法从核酸序列估计系统发生史。允许使用完整的 国际生物化学联合会(IUB)的模糊代码,并且估计祖先的核苷酸状态 。缺口被当作第五种核苷酸状态。它还可以对颠换使用简约法进行计算 。可处理多叉 (multifurcations)、重构祖先的状态 、使用0/1 的字符权重 、和推论分支长度

Dnamove

从核酸序列交互式地构建系统发生史,通过简约法和兼容性法来评估 ,并且显示出重构过的祖先的基 (ancestral bases) 。这个可以用来手动寻找简约法或者兼容性法的估计值。

Dnapenny

通过使用分支定界法来针对核酸序列寻找全部的最简约的系统发生史 (取决于数据) 这个程序对于多于10-11 个种族的情况可能不实用。

Dnacomp

使用兼容性标准从核酸序列估计出系统发生史,这个标准会搜索个数最多的能够让所有状态 (核苷酸)在同一棵树上唯一地进化的位点。当位点各自之间的进化概率差别很大时 ,兼容性就尤其适用,但是我们不能预先知道哪些是不那么可靠的位点

Dnainvar

4种种族上,针对核酸序列,计算Lake的和Cavender的系统发生论不变量,它们将会测试不同的树的拓扑。这个程序还会将不同的核苷酸模式的出现频率列成表格 。Lake的不变量方法 被他自己称作“进化简约法”。

Dnaml

针对核苷酸序列,使用最大似然法估计系统发生史。所使用的模型允许 4种核苷酸有不同的期望频率、不同的转换和颠换概率 、对不同种类的位点有不同 (预定义)的变化概率 ,还使用了一个 隐式 (Hidden)马尔可夫(Markov)概率模型,程序使用它来推断什么位点有什么概率。这也允许位点之间的概率的伽马分布 (gamma-distribution)和伽马加不变量位点分布(gamma-plus-invariant sites distributions)。

Dnamlk

与Dnaml 相同,但是会假设有一个分子时钟。同时使用这 2个程序就能够为分子时钟假设进行一个似然性测试。

Proml

使用最大似然法从蛋白质氨基酸序列估计系统发生史。可使用PAM 、JTT 或者PMB 模型, 还使用了一个 隐式(Hidden)马尔可夫(Markov)概率模型,程序使用它来推断什么位点有什么概率。这也允许位点之间的概率的伽马分布 (gamma-distribution)和伽马加不变量位点分布(gamma-plus-invariant sites distributions)。它还允许在已知的位点有着不同的改变概率

Promlk

与Proml 相同,但是 会假设有一个分子时钟。同时使用这 2个程序就能够为分子时钟假设进行一个似然性测试。

Dnadist

从核酸序列计算种族之间的4种不同的距离。这些距离可以用于那些距离矩阵程序 。这些距离是 Jukes-Cantor公式、一个基于Kimura的2-参数 (2- parameter 方法的距离、Dnaml 中使 用的F84 模型 、和LogDet 距离。这些距离还可以在不同的位点针对伽马分布的和伽马加不变量分布的变化概率作修正 。进化概率可能按照一个预定义的方式依据不同的位点而发生变化 ,同时也依据一个隐式马尔可夫模型而变化。这个程序还可以列出序列之间的相似度的一个表格

Protdist

为蛋白质序列计算一个距离度量值,使用基于Dayhoff PAM 矩阵、JTT 矩阵模型、PBM 模型、Kimura的1983 近似法、或者是一个基于基因代码的模型加上一个变化为另一种氨基酸的约束的最大似然性估计。 这些距离还可以在不同的位点针对伽马分布的和伽马加不变量分布的变化概率作修正 。进化概率可能按照一个预定义的方式依据不同的位点而发生变化 ,同时也依据一个隐式马尔可夫模型而变化。这个程序还可以列出序列之间的相似度的一个表格 这些距离可以用于那些距离矩阵程序

Restdist

从限制性位点数据或者限制性片断数据计算出来的距离。限制性位点选项也被用来为RAPD或者AFLP计算距离。

Restml

使用限制性位点数据(不是限制性片断而是单个位点的存在 /缺失 )通过最大似然性方法估计出来的系统发生史。它使用Jukes-Cantor 核苷酸变化对称模型 ,这个模型不允许转换和颠换有不同的概率 。这个程序 慢。

Seqboot

读入一个数据集,再使用自举重取样来从这个数据集产生多个数据集 。由于这个软件包的当前版本中大部分程序 都允许处理多个数据集,所以这个程序可以与一致树程序Consense 一起使用 ,与软件包中的大部分方法配合进行自举(或者半删除式剪切 delete-half-jackknife )分析。这个 程序还允许在字符之间的种族的排列的Archie/Faith 技术 。它还可以重写一个数据集 ,将它从PHYLIP 的交错 (Interleaved) 及序列化 (Sequential) 格式转换成一个新的初级版本的XML 序列比对格式 ,它个格式正在开发当中 ,并且在 Seqboot文档网页 中描述。

Fitch

“可添加的树模型”下的距离矩阵数据中估计系统发生史,根据这个模型 ,距离预期与种族之间的分支长度的和相等。使用Fitch-Margoliash 标准和某些相关的平方标准 ,或者是最小进化距离矩阵方法。不假设有一个进化时钟。这个程序对以下东西有用 :从分子序列 、限制性位点或者片断距离中计算出来的距离,DNA 杂交度量值,以及从基因频率中计算出来的基因距离。

Kitsch

“超度量(ultrametric)”模型下的距离矩阵数据中估计系统发生史,这个模型与可添加的树模型相同 ,唯一的不同就是假设有一个进化时钟。还可能使用Fitch-Margoliash 标准和其它的最小平方标准 ,或者是最小进化标准。 这个程序对以下东西有用 :从分子序列 、限制性位点或者片断距离中计算出来的距离,DNA 杂交度量值,以及从基因频率中计算出来的基因距离。

Neighbor

由Mary Kuhner 和John Yamato 做的对Saitou 和Nei的“邻居连接方法”和UPGMA(平均连接簇)方法的实现。邻居连接是一个距离矩阵方法 ,它在不假设有一个时钟的情况下产生一个无根树。UPGMA假设有一个时钟。分支长度没有使用最小平方标准进行优化,但是这些方法很快,因此可 以处理很大的数据集。

Contml

在某个模型下使用最大似然法从基因频率中估计系统发生史,在那个模型中 ,所有的分支 都是由于在没有新的变异的情况下发生的基因漂移而产生的。不假设有一个分子时钟。另一个分析这种数据 的方法就是计算Nei的基因距离 ,并且使用那些距离矩阵程序中的一个。这个程序还可以针对按照布朗运动模型进化的连续字符进行最大似然性分析 ,但是它假设那些基因 型按照相同的概率和不相关的方式进化,所以它不把通常的相关性基因型计算在内

Gendist

从基因频率数据计算3种不同的基因距离公式中的1种。这些公式是 Nei的基因距离、Cavalli-Sforza 弦测量 、和Reynolds 等人的基因距离。前者适合于那种新的变异在一个无限同 等位基因中性变异模型中出现的数据,后两者适合于那种没有变异只有单纯的基因漂移的模型 。距离 被写到一个文件里,所采用的格式适合于作为那些距离矩阵程序的输入。

Contrast

从一个树文件中读取一个树,并且读取一个有着连续性基因 型数据的数据集,再产生针对那些基因 型的独立比对,以用于任何的多变量的统计软件包 。还会为那些比对产生协方差 、回归和基因 型之间的相关性。当一个种群中有独立的显 型时,还可以为种间抽样变化而进行校正

Pars

多状态离散基因型简约法。最多允许8 个状态 (包括" ? ")。无法做Camin-Sokal 和Dollo 简约法计算 。可以处理多个分叉 、重构祖先的状态 、使用基因 型权重、以及推断分支长度。

Mix

使用某些针对有2个状态(0和1)的离散基因型数据的简约法来估计系统发生史。允许使用 Wagner 简约法 、Camin-Sokal 简约法或者这两者的任意混合。还重构祖先的状态并且允许基因 型的权重(不推断分支长度)。

Move

交互式地从有2个状态(0和1)的离散基因型数据中构造系统发生史。为那些系统发生史评估简约性和兼容性指标 ,并且显示出整棵树上的重构状态。这个程序可用来手动寻找简约性或者兼容性评估值。

Penny

为有2个状态的离散基因型数据寻找全部的最简约的系统发生史,针对Wagner 、Camin-Sokal 和混合的简约法指标使用分支定界法的精确搜索。对于多于10-11 个物种的数据 (取决于数据)可能不实用。

Dollop

针对有2个状态(0和1)的离散基因型数据使用Dollo 或者多态性简约法标准来估计系统发生史。 还重构祖先的状态并且允许基因 型的权重。Dollo简约法尤其适合于限制性位点数据;在将祖先的状态指定为未知的情况下它可能适用于限制性片断数据

Dolmove

交互式地针对有2个状态(0和1)的离散基因型数据使用Dollo 或者多态性简约法标准来估计系统发生史 为那些系统发生史评估简约性和兼容性指标 ,并且显示出整棵树上的重构状态。这个程序可用来手动寻找简约性或者兼容性评估值。

Dolpenny

为有2个状态的离散基因型数据寻找全部的最简约的系统发生史,针对Dollo或者多态 的简约法指标使用分支定界法的精确搜索。对于多于10-11 个物种的数据 (取决于数据)可能不实用。

Clique

针对有2个状态的离散基因型,寻找互相兼容的最大集团,以及它们所导向的系统发生史。最大集团(或者与最大的那个相差指定范围的大小的全部集团 )是使用一个非常快的分支定界搜索方法找到的。这个方法不允许有丢失的数据存在 。对那种情况 ,Pars 或者Mix 的 T (阈值 (Threshold) )选项可能是一个有用的替代品。兼容性方法在这种情况下尤其有用 :有些基因 型的质量很差,其它的质量很好,但是却不能预先知道哪些好、哪些差。

Factor

读入离散的多状态数据以及基因型状态树,再产生对应的有 2个状态(0和1)的数据集。由Christopher Meacham 编写 。这个程序以前用在Mix 中调整多状态的基因型,但是现在不必这么做了 ,因为有了PARS。

Drawgram

用多种用户可控制的格式来绘制有根的系统发生史、进化树 、环形树和物候图 。这个程序是交互式的 ,允许在PC 、Macintosh 或者X Windows 屏幕、或者在Tektronix 或Digital 图形终端上预览那棵树。最终的输出可以是一个为以下东西而格式化的文件 :那些绘图程序中的一个 、一个光线跟踪或者VRML 浏览器 、送往一个激光打印机 (例如Postscript 或者PCL 兼容打印机) 、图形屏幕或者终端 、笔式绘图器或者兼容图象的点阵打印机

Drawtree

与Drawgram 类似,但是绘制无根的系统发生史。

Treedist

计算树之间的分支分数(Branch Score)距离,它允许树的拓扑不相同 它还会使用分支长度。它还计算树之间的Robinson-Foulds 对称差异距离 ,这个距离也允许树的拓扑不相同 ,但是不使用分支长度。

Consense

使用多数规则一致树方法来计算一致树,它也使得你能轻松地找到严格一致树 。无法计算Adams 一致树 。树是以一种标准的嵌套式括号格式写在一个树文件里的 ,这个文件是由这个软件包中的很多树估计程序产生的 。这个程序可以用来在使用这个软件包中的很多方法进行自举分析时作为最后一步

Retree

读入一棵树(有必要的话会带有分支长度),让你重新指定树根、翻转某些分支 、改变物种名字和分支长度,再将结果输出。 可以用来在有根树和无根树之间转换,还可以将树输出到一个试验性的新的XML 树文件格式当中 ,这个格式还在开发当中 ,具体的描述在 Retree文档网页

运行这些程序

这一小节假设你已经拿到咯预编译版的PHYLIP(Windows Mac OS X 、或者 Linux版),或者你拿到咯源代码并且自己把它编译咯(Linux Unix Mac OS X Windows或者OpenVMS版本)。对于那些有预编译版的机器来说 ,通常不需要你搞到一个编译器或者亲自编译那些程序 。这个小节说的是怎 么运行那些程序 。在本文档稍后的地方会说明怎么下载并且安装PHYLIP (假如你还没做那个就已经在阅读这个文档的话)。一般地 ,你只会在下载并且安装咯PHYLIP 之后再读这个文档。

说说输入文件

对于所有这些类型的机器,很重点的一点是提前准备好将要给这些程序的输入文件(典型地是数据文件)。可以在任何编辑器中准备它们 ,但是要注意将它们保存为 纯文本 格式 (“平文本 ASCII”),而不是用像微软的Word 那样的文字编辑器所写的格式 (在微软Word 中 ,确保数据编码是 "US ASCII" ,因为使用任何的 Unicode 编码 都可能引起问题 )。你要自己阅读那些描述在程序中所使用的文件的格式的PHYLIP 文档 。在本文档的下一节里有一个部分的描述 。还可以通过运行一个以PHYLIP 格式输出的程序来获取这些输入文件 (这些程序中的某一些本身就能输出 ,还有别人写的一些程序 ,例如序列比对程序 ClustalW 和序列格式转换程序 Readseq)。在PHYLIP 里面 没有 任何程序提供输入文件的编辑器( 指望着启动其中某个程序再点一下鼠标就能创建一个数据文件)。

当这些程序开始运行时,它们首先寻找特定文件名 (例如 infile treefile intree或者fontfile 的输入文件。不同程序会寻找不同的文件名 ,你应当阅读对应程序的文档来搞清楚它们要用什么文件名 。如果你准备好咯拥有那些名字的文件 ,那么程序就会使用它们,不再向你询问文件名。如果它们没有找到拥有那些文件名的文件 ,那么程序就会告诉你说它们找不到某个名字的文件,并且要求 你输入文件名。例如 ,如果DnaML 寻找 infile 这个文件却没有找到,那么它会输出这条消息:

dnaml: can't find input file "infile"
Please enter a new file name>

这并不是说出现咯某个错误 你所需要做的就是输入文件名

程序在与自身相同的文件夹里寻找输入文件(文件夹跟 “目录”是同一个东西 。在WindowsMac OS XLinux或者Unix里,当程序向你询问文件名时 ,你可以将到达那个文件的路径作为文件名的一部分输入(比如 ,如果那个文件是在当前文件夹的上级文件夹里,那么你可以输入 ../myfile.dna 作为文件名 )。如果你不知道什么是 “文件夹”,或者什么叫“上级”,那么你就是一个只知道点鼠标并且希望会导致一堆文件名会神奇般地出现的新人类 (典型地,这样的人根本不知道文件在他她的系统中的什么地方,并且把他她们的文件系统弄得很乱 )如果是这样的话,你应该找个人跟你解释一下什么叫文件夹。

在一个Unix 或者Linux 系统上运行这些程序

以小写字符输入程序的名字(比如 dnaml )。要在程序正在运行的时候终结它 ,就按Control-C(先按住 Ctrl 键,再按 C )。

在某些系统中,你可能需要在程序名之前输入 ./ ,那样的话,上面的例子就是./dnaml。这个主要是因 为有些时候用户的PATH 中不包含当前目录 ,这通常是出于安全性考虑。

在后台运行或者通过一个脚本文件来控制这些程序的运行

在运行这些程序时,你可能会想要让它们在后台运行,这样你就可能去做别的东西咯 。在那些有窗口环 境的系统中 ,可以将它们放到单独的窗口中去执行,而像Unix 和Linux 的 nice 命令那样的命令可以用来让它们拥有更低的优先级 ,这样它们就不会干扰其它窗口中的交互式的程序咯。这一部分的说明会假设你使用一个Windows 系统或者一个Unix /Linux 系统。假如某些命令只在一个系统上有效而在另 一个系统上无效,那么我会提示。Mac OS X实际上是Unix(有木有!有木有!),所以你可以按照Unix 系统的方法去操作,必要的时候可以使用一个终端窗口。

如果没有窗口环境的话,那么在Unix 或者Linux 系统中,你可以在运行程序的时候在后面跟上一个 & 符号,这样就可以将这个任务(job)放到后台去运行咯。你需要将对那个交互式菜单进行响应的全部内容放到一个文件里 ,并且告诉那个后台任务要从那个文件里面读取它的输入。

在Windows 系统中没有&nice 命令,但是输入输出的重定向和脚本文件在命令提示符(Command)窗口里能够很好地工作。一个脚本文件可以通过单击它的图标或者在一个命令提示符窗口中输入它的名字的方式来调用 。一个脚本文件必须以 .bat 作为扩展名 ,比如 foofile.bat 。你可以在一个命令提示符窗口中输入批处理文件的文件名 (比如 foofile )来运行它,不用带 .bat

下面是一个对于Windows、Linux 或者在 Mac OS X 上使用终端(Terminal)窗口的例子:假设你想在后台运行Dnapars sequences.dat 文件中取得输入数据 将交互式的输出写入到 screenout 文件中 使用 input 文件来存储交互式的输入内容 input 文件只需要包含 2行:

sequences.dat

Y

这些就是你在以交互式的方式运行程序的时候会输入的内容,第一行是当程序找不到 infile 文件时对它的文件名请求的响应,第二行是对菜单进行响应。

要将程序放在后台运行的话,在Unix 或者Linux 系统中只需要简单地执行以下命令:

dnapars < input > screenout &

这样的话,就会启动这个程序,并且将 input 文件中的内容当作输入响应,将交互式的输出写入到 screenout 文件中 。这次运行也会创建常规的输出文件和树文件 (记住 ,如果你在这个程序正在后台运行的时候 ,在同一个目录中启动任何其它的 PHYLIP 程序,那么可能会导致一个程序的输出覆盖咯另一个程序的输出 )。

如果你想赋予这个程序比较低的优先级 便让它不要干扰其它工作,并且在你的Unix 或者Linux 系统里面有伯克利(Berkeley)Unix 类型的任务控制工具的话(通常会有的),那么你可以使用 nice 命令:

nice +10 dnapars < input > screenout &

这条命令会降低当前运行的这个程序的优先级。如果还想计时 ,并且将计时信息放到 screenout 文件 的末尾的话,你可以这样做:

nice +10 ( time dnapars < input ) >& screenout &

这条命令,我不解释。

在Unix 或者Linux 系统上,你还可以试试把交互式的输出内容斗转星移到黑洞文件 /dev/null ,这 样就不用管它咯 (但是那样的话你也没法去看它以 便搞清楚 到底是哪里出咯问题)。 如果你发现不能创建太大的文件,那么你可以试试关掉所启动的程序的某些选项。

如果你要一次性运行多个程序,比如说 ,要使用 Seqboot 、Dnapars(比如说哈) 和Consense 来做一个自举分析 ,你可以用个编辑器来创建一个包含以下命令的脚本:

seqboot < input1 > screenout

mv outfile infile

dnapars < input2 >> screenout

mv outtree intree

consense < input3 >> screenout

上面这个是Unix 或者Linux 版本--在Windows 版本里,对文件的重命名和将输出附加到 screenout 文件中去的操作是用另外的方法来做的

在Unix 或者Linux 系统中,脚本文件可以起像 foofile 这样的名字;在系统中,脚本文件的名字会是像 foofile.bat 这样的。

在Unix 或者Linux 系统上,必须使用 chmod +x foofile 命令再加上 rehash 命令来给这个脚本赋予可执行权限。 foofile 所控制的任务可在Unix 或者Linux 系统上使用以下命令来在后台运行

foofile &

在Windows 系统上,可单击这个脚本文件的图标以运行它。它的图标上会有一个小小的齿轮符号。

注意,你还必须在单独的文件 input1 、input2 input3 中为Seqboot(包括随机数种子) 、Dnapars 和Consense 准备好交互式的输入命令。还要注意 ,当 PHYLIP 中的程序在试图打开一个新的输出文件 (比如 outfile outtree 或者 plotfile )时,如果发现那个文件已经存在咯,那么它们会向你询问:覆盖它 、写到另一个文件里、将输出内容附加到那个文件的末尾、还是什么都不写干脆退出?这就意味着 ,在写脚本文件时,要注意搞清楚会不会有这种提示冒出来。你必须预先知道那个文件是不是存在 。你可能会在脚本文件中加上一些语句 来测试输出文件是不是已经存在 ,如果存在的话就删除它 ,比如说 ,在Unix 、Linux 或者Mac OS X 系统上可写上这样的指令:

if test -e fubarfile

then

rm fubarfile

fi

你甚至还可以加上一条指令来创建一个拥有那个名字的目录,这样你就可以确信它真的存在 !无论怎样,你 都会知道是否要在你的响应文件中对覆盖已有输出文件的问题进行回答咯

准备输入文件

PHYLIP 中的程序所使用的输入文件必须单独准备-在PHYLIP 中没有数据编辑器。你可以自己使用一个 文字处理器 (或者文本编辑器)来准备输入文件,或者可以使用一个产生PHYLIP 格式的输出的程序。

像 ClustalW 这样的序列比对程序一般都提供咯输出PHYLIP 格式的文件的选项,而某些其它的系统发生推论程序 ,例如MacClade 和TreeView,都能生成PHYLIP 格式的文件。

注意,一定要确保那些输入文件是“纯文本”或者"ASCII"格式的。这就意味着它们只包含可打印的ASCII/ISO 字符,不包含任何不可打印的字符。很多文字处理器 ,比如说微软的Word,都会以一种包含不可打印字符的格式来保存文件,除非是你叫它们不要这么做 。在微软的Word 及类似的字处理器中 ,如果是你第一次编辑某个文件,那么当你执行文件菜单中的保存命令时 ,程序实际上会执行一个另存为命令 ,询问你要以什么格式来保存那个文件。

  • •.如果你使用的是微软的Word,那么就选择纯文本 Plain Text )。将会弹出一个对话框 (或者,在Mac OS X 版本的Word 中,会出现一个选项( Option )按钮 ,你可以在那里选择 US-ASCII 选项。那些以 Western European (西欧) 开头的选项应当也没问题 。而其它的编码就不见得有效咯

  • •.如果你使用的是写字板(WordPad),那么就选择文本文档( Text Document (*.txt) 不要 选择 Unicode 文本文档 Unicode Text Document )。

  • •.如果你使用的是记事本(Notepad),那么就选择文本文档 Text Document )再选择 ANSI 编码,不要选择 Unicode 或者 UTF8 编码。

下一次你编辑这个文件时,使用保存命令,程序应当使用那些现有的设置而不再询问你 。如果本软件包中的程序无法读取你搞出来的输入文件 ,那么你就检查一下你是否把这些选项搞正确咯 。执行文件菜单中的另存为命令 ,再做出正确的设置。

文本编辑器,比如Unix 和Linux 系统上的vi 和emacs 编辑器、Mac OS 上的 SimpleText 、或者是 pine 邮件程序自带的编辑器 pico ,都会将输出文件保存为纯文本格式,所以不会引起问题。

输入文件的格式在下面说明,另外你还应当阅读与你将要使用的数据及程序相关的PHYLIP 文档,因为在那里能找到更详细的说明。

输入和输出文件

对于大部分PHYLIP 程序来说,都是这样的 :信息从一堆输入文件里来 ,到一堆输出文件中去 (下面这个图是原作者用纯文本画的,需要用等宽字体来看)

-------------------

| |

infile ---------> | |

| |

intree ---------> | | -----------> outfile

| |

weights --------> | program | -----------> outtree

| |

categories -----> | | -----------> plotfile

| |

fontfile -------> | |

| |

-------------------

这些程序通过显示出一个菜单来与用户交互。除咯用户在菜单中进行的选择以外,它们从文件中读取它们其它的所有输入信息 。这些文件都有默认的名字。程序会尝试着找到一个叫默认名字的文件-如果没找到,它会让用户提供一个文件名。输入数据,例如DNA 序列的默认文件名是 infile 。如果用户提供的是一个树 ,那么默认文件名是 intree 。基因型的权重是在文件 weights ,画树的程序需要数字化的字体 ,那是放在文件 fontfile 中(这些都是默认名字)。

例如,如果Dnaml 需要文件infile 却没找到,那么它会输出这条消息

dnaml: can't find input file "infile"
Please enter a new file name>

这很简单,就是要你输入那个输入文件的名字。

数据文件的格式

我一直试图保持一个固定的输入和输出文件格式。对于简约法、兼容性法和最大似然法的程序,不包括距离向量程序,最简单的输入数据是这样的

6 13

Archaeopt CGATGCTTAC CGC

HesperorniCGTTACTCGT TGT

BaluchitheTAATGTTAAT TGT

B. virginiTAATGTTCGT TGT

BrontosaurCAAAACCCAT CAT

B.subtilisGGCAGCCAAT CAC

输入文件的第一行中写的是物种和字符(在这个例子中是位点)的个数。这 2个字段是以随意格式写的 ,以空格隔开。接下来是每个物种的信息 ,开头部分是 10个字符的物种名字(其中可以包含空格和标点符号),再接下来就是那个物种的字符串咯 。名字必须与那个物种的数据中的首字符位于同一行 (对于树来说,我会使用"物种"这个术语,因为在某些情况下 ,这些东西会是种群或者单个的基因序列).

名字应当包含10个字符,如果长度不够的话 ,就用空格补满。除咯以下字符之外,其它的可打印ASCII/ISO 字符都允许出现在名字中 :括号 (" ( "和" ) ")、方括号("["和"]")、冒号(":")、分号(";")和逗号(",")。如果你忘记咯用空格将名字补充到 10个字符的长度,那么程序就会因为数据文件的内容没有对齐而出错,最后向你报告一个错误。

注意,在物种名字中,制表符只算做一个字符。如果包含咯制表符的话 ,可能会引起麻烦。可能你看起来那个名字已经有 10个字符咯,但是在程序看来却没有10个字符。如果你使用文字处理器 ,例如Word,来制作数据文件,那么严重建议你检查一下以确保其中没有制表符 。你可以这样检查:在名字中使用方向键来移动光标,看看是不是会突然向前移动 2个或者更多的字符的距离。最好是用空格来填充名字,而不是用制表符

在离散字符程序、DNA 序列程序和蛋白质序列程序中,每个字符都是一个单个的字母或者数字,有些时 候是由空格分隔开的 。在连续性字符程序中 ,它们是带小数点的实数,中间用空格分隔:

Latimeria 2.03 3.457 100.2 0.0 -3.7

对于那些包含咯超过一行的数据的处理方法,分子序列程序和其它程序是不同的 。分子序列程序可以接受 “对齐”或者“交错”格式的数据,在后一种格式中 ,我们首先有一些行给出每个序列的第一部分,接下来又有一些行给出每个序列的下一部分,如此下去。所以 ,序列看起来是这样的:

6 39

Archaeopt CGATGCTTAC CGCCGATGCT

HesperorniCGTTACTCGT TGTCGTTACT

BaluchitheTAATGTTAAT TGTTAATGTT

B. virginiTAATGTTCGT TGTTAATGTT

BrontosaurCAAAACCCAT CATCAAAACC

B.subtilisGGCAGCCAAT CACGGCAGCC

TACCGCCGAT GCTTACCGC

CGTTGTCGTT ACTCGTTGT

AATTGTTAAT GTTAATTGT

CGTTGTTAAT GTTCGTTGT

CATCATCAAA ACCCATCAT

AATCACGGCA GCCAATCAC

注意,在这些序列中,每隔10个位点就有一个空格,这样就更容易读取:任意个空格都可以 。那个用来分隔 2组文字行(包含1-20 位点的那些行和包含21-39 位点的那些行)的空行可以有也可以没有。有一点很重要 ,每个组中 ,所有物种的位点个数都要是相同的(也就是说 ,不可能出现这样的情况:第一个物种的那一行有 20 个碱基 ,第二个物种的那一行有 21 个碱基,而程序竟然正确地运行咯 )。

或者,可以在菜单中选择另一个选项,以让程序按照 “串行化”格式来解释数据,首先是第一个物种的全部数据,接下来是第二个物种的全部字符,如此下去。这也是那些离散字符程序和 基因频率及数量字符程序 读取数据的方式。它们不接受交错格式。

在串行化格式中,字符数据可以延伸到新的行中 (除非是在一个物种名字的中间 ,或者,在连续字符和距离矩阵程序中 ,不能在一个实数的中间换行 )。所以可以写上这样的内容:

Archaeopt 001100
1101

甚至是这样的

Archaeopt
0011001101

但是,要注意,物种名字还是必须整整 10个字符的长度:在上面的例子中 ,"t"的后面必须有一个空格。在任何情况下 ,都可以向字符值中添加空格,所以

Archaeopt 0011001101 0111011100

是允许的

Note that you can convert molecular sequence data between the interleaved and the sequential data formats by using the Rewrite option of the J menu item in Seqboot.

If you make an error in the format of the input file, the programs can sometimes detect that they have been fed an illegal character or illegal numerical value and issue an error message such as BAD CHARACTER STATE:, often printing out the bad value, and sometimes the number of the species and character in which it occurred. The program will then stop shortly after. One of the things which can lead to a bad value is the omission of something earlier in the file, or the insertion of something superfluous, which cause the reading of the file to get out of synchronization. The program then starts reading things it didn't expect, and concludes that they are in error. So if you see this error message, you may also want to look for the earlier problem that may have led to the program becoming confused about what it is reading.

Some options are described below, but you should also read the documentation for the groups of the programs and for the individual programs.

The Menu

The menu is straightforward. It typically looks like this (this one is for Dnapars):

DNA parsimony algorithm, version 3.6

Setting for this run:

U Search for best tree? Yes

S Search option? More thorough search

V Number of trees to save? 10000

J Randomize input order of sequences? No. Use input order

O Outgroup root? No, use as outgroup species 1

T Use Threshold parsimony? No, use ordinary parsimony

N Use Transversion parsimony? No, count all steps

W Sites weighted? No

M Analyze multiple data sets? No

I Input sequences interleaved? Yes

0 Terminal type (IBM PC, ANSI, none)? ANSI

1 Print out the data at start of run No

2 Print indications of progress of run Yes

3 Print out tree Yes

4 Print out steps in each site No

5 Print sequences at all nodes of tree No

6 Write out trees onto tree file? Yes

Y to accept these or type the letter for one to change

If you want to accept the default settings (they are shown in the above case) you can simply type Y followed by pressing on the Enter key. If you want to change any of the options, you should type the letter shown to the left of its entry in the menu. For example, to set a threshold type T. Lower-case letters will also work. For many of the options the program will ask for supplementary information, such as the value of the threshold.

Note the Terminal type entry, which you will find on all menus. It allows you to specify which type of terminal your screen is. The options are an IBM PC screen, an ANSI standard terminal, or none. Choosing zero (0) toggles among these three options in cyclical order, changing each time the 0 option is chosen. If one of them is right for your terminal the screen will be cleared before the menu is displayed. If none works, the none option should probably be chosen. The programs should start with a terminal option appropriate for your computer, but if they do not, you can change the terminal type manually. This is particularly important in program Retree where a tree is displayed on the screen - if the terminal type is set to the wrong value, the tree can look very strange.

The other numbered options control which information the program will display on your screen or on the output files. The option to Print indications of progress of run will show information such as the names of the species as they are successively added to the tree, and the progress of rearrangements. You will usually want to see these as reassurance that the program is running and to help you estimate how long it will take. But if you are running the program "in background" as can be done on multitasking and multiuser systems, and do not have the program running in its own window, you may want to turn this option off so that it does not disturb your use of the computer while the program is running. Note also menu option 3, "Print out tree". This can be useful when you are running many data sets, and will be using the resulting trees from the output tree file. It may be helpful to turn off the printing out of the trees in that case, particularly if those files would be too big.

The Output File

Most of the programs write their output onto a file called (usually) outfile, and a representation of the trees found onto a file called outtree.

The exact contents of the output file vary from program to program and also depend on which menu options you have selected. For many programs, if you select all possible output information, the output will consist of (1) the name of the program and its version number, (2) some of the input information printed out, and (3) a series of phylogenies, some with associated information indicating how much change there was in each character or on each part of the tree. A typical rooted tree looks like this:

+-------------------Gibbon

+----------------------------2

! ! +------------------Orang

! +------4

! ! +---------Gorilla

+-----3 +--6

! ! ! +---------Chimp

! ! +----5

--1 ! +-----Human

! !

! +-----------------------------------------------Mouse

!

+------------------------------------------------Bovine

The interpretation of the tree is fairly straightforward: it "grows" from left to right. The numbers at the forks are arbitrary and are used (if present) merely to identify the forks. For many of the programs the tree produced is unrooted. Rooted and unrooted trees are printed in nearly the same form, but the unrooted ones are accompanied by the warning message:

remember: this is an unrooted tree!

to indicate that this is an unrooted tree and to warn against taking the position of its root too seriously. (Mathematicians still call an unrooted tree a tree, though some systematists unfortunately use the term "network" for an unrooted tree. This conflicts with standard mathematical usage, which reserves the name "network" for a completely different kind of graph). The root of this tree could be anywhere, say on the line leading immediately to Mouse. As an exercise, see if you can tell whether the following tree is or is not a different one from the above:

+-----------------------------------------------Mouse

!

+---------4 +------------------Orang

! ! +------3

! ! ! ! +---------Chimp

---6 +----------------------------1 ! +----2

! ! +--5 +-----Human

! ! !

! ! +---------Gorilla

! !

! +-------------------Gibbon

!

+-------------------------------------------Bovine

remember: this is an unrooted tree!

(it is not different). It is important also to realize that the lengths of the segments of the printed tree may not be significant: some may actually represent branches of zero length, in the sense that there is no evidence that those branches are nonzero in length. Some of the diagrams of trees attempt to print branches approximately proportional to estimated branch lengths, while in others the lengths are purely conventional and are presented just to make the topology visible. You will have to look closely at the documentation that accompanies each program to see what it presents and what is known about the lengths of the branches on the tree. The above tree attempts to represent branch lengths approximately in the diagram. But even in those cases, some of the smaller branches are likely to be artificially lengthened to make the tree topology clearer. Here is what a tree from Dnapars looks like, when no attempt is made to make the lengths of branches in the diagram proportional to estimated branch lengths:

+--Human

+--5

+--4 +--Chimp

! !

+--3 +-----Gorilla

! !

+--2 +--------Orang

! !

+--1 +-----------Gibbon

! !

--6 +--------------Mouse

!

+-----------------Bovine

remember: this is an unrooted tree!

When a tree has branch lengths, it will be accompanied by a table showing for each branch the numbers (or names) of the nodes at each end of the branch, and the length of that branch. For the first tree shown above, the corresponding table is:

Between And Length Approx. Confidence Limits

------- --- ------ ------- ---------- ------

1 Bovine 0.90216 ( 0.50346, 1.30086) **

1 Mouse 0.79240 ( 0.42191, 1.16297) **

1 2 0.48553 ( 0.16602, 0.80496) **

2 3 0.12113 ( zero, 0.24676) *

3 4 0.04895 ( zero, 0.12668)

4 5 0.07459 ( 0.00735, 0.14180) **

5 Human 0.10563 ( 0.04234, 0.16889) **

5 Chimp 0.17158 ( 0.09765, 0.24553) **

4 Gorilla 0.15266 ( 0.07468, 0.23069) **

3 Orang 0.30368 ( 0.18735, 0.41999) **

2 Gibbon 0.33636 ( 0.19264, 0.48009) **

* = significantly positive, P < 0.05

** = significantly positive, P < 0.01

Ignoring the asterisks and the approximate confidence limits, which will be described in the documentation file for Dnaml, we can see that the table gives a more precise idea of what the lengths of all the branches are. Similar tables exist in distance matrix and likelihood programs, as well as in the parsimony programs Dnapars and Pars.

Some of the parsimony programs in the package can print out a table of the number of steps that different characters (or sites) require on the tree. This table may not be obvious at first. A typical example looks like this:

steps in each site:

0 1 2 3 4 5 6 7 8 9

*-----------------------------------------

0! 2 2 2 2 1 1 2 2 1

10! 1 2 3 1 1 1 1 1 1 2

20! 1 2 2 1 2 2 1 1 1 2

30! 1 2 1 1 1 2 1 3 1 1

40! 1

The numbers across the top and down the side indicate which site is being referred to. Thus site 23 is column "3" of row "20" and has 1 step in this case.

There are many other kinds of information that can appear in the output file, They vary from program to program, and we leave their description to the documentation files for the specific programs.

The Tree File

In output from most programs, a representation of the tree is also written into the tree file outtree. The tree is specified by nested pairs of parentheses, enclosing names and separated by commas. We will describe how this works below. If there are any blanks in the names, these must be replaced by the underscore character "_". Trailing blanks in the name may be omitted. The pattern of the parentheses indicates the pattern of the tree by having each pair of parentheses enclose all the members of a monophyletic group. The tree file could look like this:

((Mouse,Bovine),(Gibbon,(Orang,(Gorilla,(Chimp,Human)))));

In this tree the first fork separates the lineage leading to Mouse and Bovine from the lineage leading to the rest. Within the latter group there is a fork separating Gibbon from the rest, and so on. The entire tree is enclosed in an outermost pair of parentheses. The tree ends with a semicolon. In some programs such as Dnaml, Fitch, and Contml, the tree will be unrooted. An unrooted tree should have its bottommost fork have a three-way split, with three groups separated by two commas:

(A,(B,(C,D)),(E,F));

Here the three groups at the bottom node are A, (B,C,D), and (E,F). The single three-way split corresponds to one of the interior nodes of the unrooted tree (it can be any interior node of the tree). The remaining forks are encountered as you move out from that first node. In newer programs, some are able to tolerate these other forks being multifurcations (multi-way splits). You should check the documentation files for the particular programs you are using to see in which of these forms you can expect the user tree to be in. Note that many of the programs that actually estimate an unrooted tree (such as Dnapars) produce trees in the treefile in rooted form! This is done for reasons of arbitrary internal bookkeeping. The placement of the root is arbitrary. We are working toward having all programs be able to read all trees, whether rooted or unrooted, multifurcating or bifurcating, and having them do the right thing with them. But this is a long-term goal and it is not yet achieved.

For programs that infer branch lengths, these are given in the trees in the tree file as real numbers following a colon, and placed immediately after the group descended from that branch. Here is a typical tree with branch lengths:

((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,

bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,

seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);

Note that the tree may continue to a new line at any time except in the middle of a name or the middle of a branch length, although in trees written to the tree file this will only be done after a comma.

These representations of trees are a subset of the standard adopted on 24 June 1986 at the annual meetings of the Society for the Study of Evolution by an informal committee (its final session in Newick's lobster restaurant - hence its name, the Newick standard) consisting of Wayne Maddison (author of MacClade), David Swofford (PAUP), F. James Rohlf (NTSYS-PC), Chris Meacham (COMPROB and the original PHYLIP tree drawing programs), James Archie, William H.E. Day, and me. This standard is a generalization of PHYLIP's format, itself based on a well-known representation of trees in terms of parenthesis patterns which is due to the famous mathematician Arthur Cayley, and which has been around for over a century. The standard is now employed by most phylogeny computer programs but unfortunately has yet to be decribed in a formal published description. Other descriptions by me and by Gary Olsen can be accessed using the Web at:

http://evolution.gs.washington.edu/phylip/newicktree.html

The Options and How To Invoke Them

Most of the programs allow various options that alter the amount of information the program is provided or what is done with the information. Options are selected in the menu.

Common options in the menu

A number of the options from the menu, the U (User tree), G (Global), J (Jumble), O (Outgroup), W (Weights), T (Threshold), M (multiple data sets), and the tree output options, are used so widely that it is best to discuss them in this document.

The U (User tree) option. This option toggles between the default setting, which allows the program to search for the best tree, and the User tree setting, which reads a tree or trees ("user trees") from the input tree file and evaluates them. The input tree file's default name is intree. In many cases the programs will also tolerate having the trees be preceded by a line giving the number of trees:

((Alligator,Bear),((Cow,(Dog,Elephant)),Ferret));

((Alligator,Bear),(((Cow,Dog),Elephant),Ferret));

((Alligator,Bear),((Cow,Dog),(Elephant,Ferret)));

An initial line with the number of trees was formerly required, but this now can be omitted. Some programs require rooted trees, some unrooted trees, and some can handle multifurcating trees. You should read the documentation for the particular program to find out which it requires. Program Retree can be used to convert trees among these forms (on saving a tree from Retree, you are asked whether you want it to be rooted or unrooted).

In using the user tree option, check the pattern of parentheses carefully. The programs do not always detect whether the tree makes sense, and if it does not there will probably be a crash (hopefully, but not inevitably, with an error message indicating the nature of the problem). Trees written out by programs are typically in the proper form.

The G (Global) option. In the programs which construct trees (except for Neighbor, the "...penny" programs and Clique, and of course the "...move" programs where you construct the trees yourself), after all species have been added to the tree a rearrangements phase ensues. In most of these programs the rearrangements are automatically global, which in this case means that subtrees will be removed from the tree and put back on in all possible ways so as to have a better chance of finding a better tree. Since this can be time consuming (it roughly triples the time taken for a run) it is left as an option in some of the programs, specifically Contml, Fitch, and Dnaml. In these programs the G menu option toggles between the default of local rearrangement and global rearrangement. The rearrangements are explained more below.

The J (Jumble) option. In most of the tree construction programs (except for the "...penny" programs and Clique), the exact details of the search of different trees depend on the order of input of species. In these programs J option enables you to tell the program to use a random number generator to choose the input order of species. This option is toggled on and off by selecting option J in the menu. The program will then prompt you for a "seed" for the random number generator. The seed should be an integer between 1 and 232-3 (which is 4,294,967,293), and should be of form 4n+1, which means that it must give a remainder of 1 when divided by 4. This can be judged by looking at the last two digits of the number (for example, in the upper limit given above, the last two digits are 93, which is of form 4n+1. Each different seed leads to a different sequence of addition of species. By simply changing the random number seed and re-running the programs one can look for other, and better trees. If the seed entered is not odd, the program will not proceed, but will prompt for another seed.

The Jumble option also causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run).

Some people have asked what are good values of the random number seed. The random number seed is used to start a process of choosing "random" (actually pseudorandom) numbers, which behave as if they were unpredictably randomly chosen between 0 and 232-1 (which is 4,294,967,295). You could put in the number 133 and find that the next random number was 221,381,825. As they are effectively unpredictable, there is no such thing as a choice that is better than any other, provided that the numbers are of the form 4n+1. However if you re-use a random number seed, the sequence of random numbers that result will be the same as before, resulting in exactly the same series of choices, which may not be what you want.

The O (Outgroup) option. This specifies which species is to have the root of the tree be on the line leading to it. For example, if the outgroup is a species "Mouse" then the root of the tree will be placed in the middle of the branch which is connected to this species, with Mouse branching off on one side of the root and the lineage leading to the rest of the tree on the other. This option is toggled on and off by choosing O in the menu (the alphabetic character O, not the digit 0). When it is on, the program will then prompt for the number of the outgroup (the species being taken in the numerical order that they occur in the input file). Responding by typing 6 and then an Enter character indicates that the sixth species in the data (the 6th in the first set of data if there are multiple data sets) is taken as the outgroup. Outgroup-rooting will not be attempted if the data have already established a root for the tree from some other consideration, and may not be if it is a user-defined tree, despite your invoking the option. Thus programs such as Dollop that produce only rooted trees do not allow the Outgroup option. It is also not available in Kitsch, Dnamlk, or Clique. When it is used, the tree as printed out is still listed as being an unrooted tree, though the outgroup is connected to the bottommost node so that it is easy to visually convert the tree into rooted form.

The T (Threshold) option. This sets a threshold forn the parsimony programs such that if the number of steps counted in a character is higher than the threshold, it will be taken to be the threshold value rather than the actual number of steps. The default is a threshold so high that it will never be surpassed (in which case the steps whill simply be counted). The T menu option toggles on and off asking the user to supply a threshold. The use of thresholds to obtain methods intermediate between parsimony and compatibility methods is described in my 1981b paper. When the T option is in force, the program will prompt for the numerical threshold value. This will be a positive real number greater than 1. In programs Mix, Move, Penny, Protpars, Dnapars, Dnamove, and Dnapenny, do not use threshold values less than or equal to 1.0, as they have no meaning and lead to a tree which depends only on considerations such as the input order of species and not at all on the character state data! In programs Dollop, Dolmove, and Dolpenny the threshold should never be 0.0 or less, for the same reason. The T option is an important and underutilized one: it is, for example, the only way in this package (except for program Dnacomp) to do a compatibility analysis when there are missing data. It is a method of de-weighting characters that evolve rapidly. I wish more people were aware of its properties.

The M (Multiple data sets) option. In menu programs there is an M menu option which allows one to toggle on the multiple data sets option. The program will ask you how many data sets it should expect. The data sets have the same format as the first data set. Here is a (very small) input file with two five-species data sets:

5 6

Alpha CCACCA

Beta CCAAAA

Gamma CAACCA

Delta AACAAC

Epsilon AACCCA

5 6

Alpha CACACA

Beta CCAACC

Gamma CAACAC

Delta GCCTGG

Epsilon TGCAAT

The main use of this option will be to allow all of the methods in these programs to be bootstrapped. Using the program Seqboot one can take any DNA, protein, restriction sites, gene frequency or binary character data set and make multiple data sets by bootstrapping. Trees can be produced for all of these using the M option. They will be written on the tree output file if that option is left in force. Then the program Consense can be used with that tree file as its input file. The result is a majority rule consensus tree which can be used to make confidence intervals. The present version of the package allows, with the use of Seqboot and Consense and the M option, bootstrapping of many of the methods in the package.

Programs Dnaml, Dnapars and Pars can also take multiple weights instead of multiple data sets. They can then do bootstrapping by reading in one data set, together with a file of weights that show how the characters (or sites) are reweighted in each bootstrap sample. Thus a site that is omitted in a bootstrap sample has effectively been given weight 0, while a site that has been duplicated has effectively been given weight 2. Seqboot has a menu selection to produce the file of weights information automatically, instead of producing a file of multiple data sets. It can be renamed and used as the input weights file.

The W (Weights) option. This signals the program that, in addition to the data set, you want to read in a series of weights that tell how many times each character is to be counted. If the weight for a character is zero (0) then that character is in effect to be omitted when the tree is evaluated. If it is (1) the character is to be counted once. Some programs allow weights greater than 1 as well. These have the effect that the character is counted as if it were present that many times, so that a weight of 4 means that the character is counted 4 times. The values 0-9 give weights 0 through 9, and the values A-Z give weights 10 through 35. By use of the weights we can give overwhelming weight to some characters, and drop others from the analysis. In the molecular sequence programs only two values of the weights, 0 or 1 are allowed.

The weights are used to analyze subsets of the characters, and also can be used for resampling of the data as in bootstrap and jackknife resampling. For those programs that allow weights to be greater than 1, they can also be used to emphasize information from some characters more strongly than others. Of course, you must have some rationale for doing this.

The weights are provided as a sequence of digits. Thus they might be

10011111100010100011110001100

The weights are to be provided in an input file whose default name is weights. The weights in it are a simple string of digits. Blanks in the weightfile are skipped over and ignored, and the weights can continue to a new line. In programs such as Seqboot that can also output a file of weights, the input weights have a default file name of inweights, and the output file name has a default file name of outweights.

Weights can be used to analyze different subsets of characters (by weighting the rest as zero). Alternatively, in the discrete characters programs they can be used to force a certain group to appear on the phylogeny (in effect confining consideration to only phylogenies containing that group). This is done by adding an imaginary character that has 1's for the members of the group, and 0's for all the other species. That imaginary character is then given the highest weight possible: the result will be that any phylogeny that does not contain that group will be penalized by such a heavy amount that it will not (except in the most unusual circumstances) be considered. Of course, the new character brings extra steps to the tree, but the number of these can be calculated in advance and subtracted out of the total when reporting the results. This use of weights is an important one, and one sadly ignored by many users who could profit from it. In the case of molecular sequences we cannot use weights this way, so that to force a given group to appear we have to add a large extra segment of sites to the molecule, with (say) A's for that group and C's for every other species.

The option to write out the trees into a tree file. This specifies that you want the program to write out the tree not only on its usual output, but also onto a file in nested-parenthesis notation (as described above). This option is sufficiently useful that it is turned on by default in all programs that allow it. You can optionally turn it off if you wish, by typing the appropriate number from the menu (it varies from program to program). This option is useful for creating tree files that can be directly read into the programs, including the consensus tree and tree distance programs, and the tree plotting programs.

The output tree file has a default name of outtree.

The ( 0 ) terminal type option . (This is the digit 0, not the alphabetic character O). The program will default to one particular assumption about your terminal (ANSI in the case of Linux, Unix, or Mac OS X, and IBM PC in the case of Windows). You can alternatively select it to be either an IBM PC, or nothing. This affects the ability of the programs to clear the screen when they display their menus, and the graphics characters used to display trees in the programs Dnamove, Move, Dolmove, and Retree. In the case of Windows, the screen will clear properly with either the IBM PC or the ANSI settings, but the graphics characters needed by Move, Dnamove, Dolmove, or Retree will display correctly only with the IBM PC setting.

The Algorithm for Constructing Trees

All of the programs except Factor, Dnadist, Gendist, Dnainvar, Seqboot, Contrast, Retree, and the plotting and consensus tree programs act to construct an estimate of a phylogeny. Move, Dolmove, and Dnamove let you construct it yourself by hand. All of the rest but Neighbor, the "...penny" programs and Clique make use of a common approach involving additions and rearrangements. They are trying to minimize or maximize some quantity over the space of all possible evolutionary trees. Each program contains a part that, given the topology of the tree, evaluates the quantity that is being minimized or maximized. The straightforward approach would be to evaluate all possible tree topologies one after another and pick the one which, according to the criterion being used, is best. This would not be possible for more than a small number of species, since the number of possible tree topologies is enormous. A review of the literature on the counting of evolutionary trees will be found one of my papers (Felsenstein, 1978a) and in my book (Felsenstein, 2004, chapter 3).

Since we cannot search all topologies, these programs are not guaranteed to always find the best tree, although they seem to do quite well in practice. The strategy they employ is as follows: the species are taken in the order in which they appear in the input file. The first two (in some programs the first three) are taken and a tree constructed containing only those. There is only one possible topology for this tree. Then the next species is taken, and we consider where it might be added to the tree. If the initial tree is (say) a rooted tree with two species and we want the resulting three-species tree to be a bifurcating tree, there are only three places where we could add the third species. Each of these is tried, and each time the resulting tree is evaluated according to the criterion. The best one is chosen to be the basis for further operations. Now we consider adding the fourth species, again at each of the five possible places that would result in a bifurcating tree. Again, the best of these is accepted. This is usually known as the Sequential Addition strategy.

Local rearrangements

The process continues in this manner, with one important exception. After each species is added, and before the next is added, a number of rearrangements of the tree are tried, in an effort to improve it. The algorithms move through the tree, making all possible local rearrangements of the tree. A local rearrangement involves an internal segment of the tree in the following manner. Each internal segment of the tree is of this form (where T1, T2, and T3 are subtrees - parts of the tree that can contain further forks and tips):

T1 T2 T3

/ /

/ /

/ /

/ /

* /

* /

* /

* /

*

!

!

the segment we are discussing being indicated by the asterisks. A local rearrangement consists of switching the subtrees T1 and T3 or T2 and T3, so as to obtain one of the following:

T3 T2 T1 T1 T3 T2

/ / / /

/ / / /

/ / / /

/ / / /

/ /

/ /

/ /

/ /

! !

! !

! !

Each time a local rearrangement is successful in finding a better tree, the new arrangement is accepted. The phase of local rearrangements does not end until the program can traverse the entire tree, attempting local rearrangements, without finding any that improve the tree.

This strategy of adding species and making local rearrangements will look at about (n-1)x(2n-3) different topologies, though if rearrangements are frequently successful the number may be larger. I have been describing the strategy when rooted trees are being considered. For unrooted trees there is a precisely similar strategy, though the first tree constructed may be a three-species tree and the rearrangements may not start until after the addition of the fifth species.

These local rearrangements have come to be called Nearest Neighbor Interchanges (NNIs) in the phylogeny literature.

Though we are not guaranteed to have found the best tree topology, we are guaranteed that no nearby topology (i. e. none accessible by a single local rearrangement) is better. In this sense we have reached a local optimum of our criterion. Note that the whole process is dependent on the order in which the species are present in the input file. We can try to find a different and better solution by reordering the species in the input file and running the program again (or, more easily, by using the J option). If none of these attempts finds a better solution, then we have some indication that we may have found the best topology, though we can never be certain of this.

Note also that a new topology is never accepted unless it is better than the previous one, so that the rearrangement process can never fall into an endless loop. This is also the way ties in our criterion are resolved, namely by sticking with the tree found first. However, the tree construction programs other than Clique, Contml, Fitch, and Dnaml do keep a record of all trees found that are tied with the best one found. This gives you some immediate idea of which parts of the tree can be altered without affecting the quality of the result.

Global rearrangements

A feature of most of the programs, such as Protpars, Dnapars, Dnacomp, Dnaml, Dnamlk, Restml, Kitsch, Fitch, Contml, Mix, and Dollop, is "global" optimization of the tree. In four of these (Contml, Fitch, Dnaml and Dnamlk) this is an option, G. In the others it automatically applies. When it is present there is an additional stage to the search for the best tree. Each possible subtree is removed from the tree from the tree and added back in all possible places. This process continues until all subtrees can be removed and added again without any improvement in the tree. The purpose of this extra rearrangement is to make it less likely that one or more a species gets "stuck" in a suboptimal region of the space of all possible trees. The use of global optimization results in approximately a tripling (3 x ) of the run-time, which is why I have left it as an option in some of the slower programs.

What PHYLIP calls "global" rearrangements are more properly called SPR (subtree pruning and regrafting) by Swofford et. al. (1996) as distinct from the NNI (nearest neighbor interchange) rearrangements that PHYLIP also uses, and the TBR (tree bisection and reconnection) rearrangements that it does not use. My book (Felsenstein, 2004, chapter 4) contains a review of work on these and other rearrangements and search methods.

The programs doing global optimization print out a dot "." after each group is removed and re-added to the tree, to give the user some sign that the rearrangements are proceeding. A new line of dots is started whenever a new round of global rearrangements is started following an improvement in the tree. On the line before the dots are printed there is printed a bar of the form "!---------------!" to show how many dots to expect. The dots will not be printed out at a uniform rate, but the later dots, which represent removal of larger groups from the tree and trying them consequently in fewer places, will print out more quickly. With some compilers each row of dots may not be printed out until it is complete.

It should be noted that Penny, Dolpenny, Dnapenny and Clique use a more sophisticated strategy of "depth-first search" with a "branch and bound" search method that guarantees that all of the best trees will be found. In the case of Penny, Dolpenny and Dnapenny there can be a considerable sacrifice of computer time if the number of species is greater than about ten: it is a matter for you to consider whether it is worth it for you to guarantee finding all the most parsimonious trees, and that depends on how much free computer time you have! Clique finds all largest cliques, and does so without undue burning of computer time. Although all of these problems that have been investigated fall into the category of "NP-hard" problems that in effect do not have a rapid solution, the cases that cause this trouble for the largest-cliques algorithm in Clique apparently are not biologically realistic and do not occur in actual data.

Multiple jumbles

As just mentioned, for most of these programs the search depends on the order in which the species are entered into the tree. Using the J (Jumble) option you can supply a random number seed which will allow the program to put the species in in a random order. Jumbling can be done multiple times. For example, if you tell the program to do it 10 times, it will go through the tree-building process 10 times, each with a different random order of adding species. It will keep a record of the trees tied for best over the whole process. In other words, it does not just record the best trees from each of the 10 runs, but records the best ones overall. Of course this is slow, taking 10 times longer than a single run. But it does give us a much greater chance of finding all of the most parsimonious trees. In the terminology of Maddison (1991) it can find different "islands" of trees. The present algorithms do not guarantee us to find all trees in a given "island" from a single run, so multiple runs also help explore those "islands" that are found.

Saving multiple tied trees

For the parsimony and compatibility programs, one can have a perfect tie between two or more trees. In these programs these trees are all saved. For the newer parsimony programs such as Dnapars and Pars, global rearrangement is carried out on all of these tied trees. This can be turned off in the menu.

For trees with criteria which are real numbers, such as the distance matrix programs Fitch and Kitsch, and the likelihood programs Dnaml, Dnamlk, Contml, and Restml, it is difficult to get an exact tie between trees. Consequently these programs save only the single best tree (even though the others may be only a tiny bit worse).

Strategy for finding the best tree

In practice, it is advisable to use the Jumble option to evaluate many different orderings of the input species. It is advisable to use the Jumble option and specify that it be done many times (as many as different orderings of the input species). (This is usually not necessary when bootstrapping, though the programs will then default to doing it once to avoid artifacts caused by the order in which species are added to the tree.)

People who want a magic "black box" program whose results they do not have to question (or think about) often are upset that these programs give results that are dependent on the order in which the species are entered in the data. To me this property is an advantage, for it permits you to try different searches for better trees, simply by varying the input order of species. If you do not use the multiple Jumble option, but do multiple individual runs instead, you can easily decide which to pay most attention to - the one or ones that are best according to the criterion employed (for example, with parsimony, the one out of the runs that results in the tree with the fewest changes).

In practice, in a single run, it usually seems best to put species that are likely to be sources of confusion in the topology last, as by the time they are added the arrangement of the earlier species will have stabilized into a good configuration, and then the last few species will by fitted into that topology. There will be less chance this way of a poor initial topology that would affect all subsequent parts of the search. However, a variety of arrangements of the input order of species should be tried, as can be done if the J option is used, and no species should be kept in a fixed place in the order of input. Note that the results of the "...penny" programs and Clique are not sensitive to the input order of species, and Neighbor is only slightly sensistive to it, so that multiple Jumbling is not possible with those programs. Note also that with global search, which is standard in many programs and in others is an option, each group (including each individual species) will be removed and re-added in all possible positions, so that a species causing confusion will have more chance of moving to a new location than it would without global rearrangement.

A Warning on Interpreting Results

Probably the most important thing to keep in mind while running any of the parsimony or compatibility programs is not to overinterpret the result. Some users treat the set of most parsimonious trees as if it were a confidence interval. If a group appears in all of the most parsimonious trees then they treat it as well established. Unfortunately the confidence interval on phylogenies appears to be much larger than the set of all most parsimonious trees (Felsenstein, 1985b). Likewise, variation of result among different methods will not be a good indicator of the size of the confidence interval. Consider a simple data set in which, out of 100 binary characters, 51 recommend the unrooted tree ((A,B),(C,D)) and 49 the tree ((A,D),(B,C)). Many different methods will all give the same result on such a data set: they will estimate the tree as ((A,B),(C,D)). Nevertheless it is clear that the 51:49 margin by which this tree is favored is not statistically significantly different from 50:50. So consistency among different methods is a poor guide to statistical significance.

General Comments on Adapting
the Package to Different Computer Systems

In the sections following you will find instructions on how to adapt the programs to different computers and compilers. The programs should compile without alteration on most versions of C. They use the "malloc" library or "calloc" function to allocate memory so that the upper limits on how many species or how many sites or characters they can run is set by the system memory available to that memory-allocation function.

In the document file for each program, I have supplied a small input example, and the output it produces, to help you check whether the programs are running properly.

Compiling the programs

If you have not been able to get executables for PHYLIP, you should be able to make your own. This can be easy under Linux and Unix, but more difficult if you have a Macintosh or a Windows system. If you have the latter, we strongly recommend you download and use the Macintosh and Windows executables that we distribute. If you do that, you will not need to have any compiler or to do any compiling. I get a certain number of inquiries each year from confused users who are not sure what a compiler is but think they need one. After downloading the executables they contact me and complain that they did not find a compiler included in the package, and would I please e-mail them the compiler. What they really need to do is use the executables and forget about compiling them.

Some users may also need to compile the programs in order to modify them. The instructions below will help with this.

I will discuss how to compile PHYLIP using one of a number of widely-used compilers. After these I will comment on compiling PHYLIP on other, less widely-used systems.

Unix and Linux

For Unix and Linux (which is Unix in all important functional respects, if not in all legal respects) you must compile PHYLIP yourself. This is usually easy to do yourself. Unix (and Linux) systems generally have a C compiler and have the make utility. We distribute with the PHYLIP source code a Unix-compatible Makefile. We use GNU's make utility, which might be installed on your system as "make" or as "gmake".

However, note that some popular Linux distributions do not include a C compiler in their default configuration. For example, in RedHat Linux version 8, the "Personal Workstation" installation that is the default does not include the C compiler or the X Windows libraries needed to compile PHYLIP. These are available, and can be loaded from the CDROMs in the distribution. The following instructions assume that you have the C compiler and X libraries. If you cannot easily configure your system to include them, you should look into using the RedHat RPM binary distribution, mentioned on the PHYLIP 3.6 web page.

As is mentioned below (under Macintoshes) the Mac OS X operating system is a Unix, and if the X windows windowing system is installed, these Unix instructions will work for it.

After you have finished unpacking the Documentation and Source Code archive, you will find that you have created a folder phylip-3.68 in which there are three folders, called exe, src, and doc. There is also an HTML web page, phylip.html. The exe folder will be empty, src contains the source code files, including the Makefile. Directory doc contains the documentation files.

Enter the src folder. Before you compile, you will want to look at the Makefile and see whether you want to alter the compilation command. We have the default C compiler flags set with no flags. If you have modified the programs, you might want to use the debugging flags "-g". On the other hand, if you are trying to make a fast executable using the GCC compiler, you may want to use the one which is "An optimized one for gcc". In either case, remove the "#" before that CFLAGS command, and place it before the CFLAGS command that was previously in use. There are careful instructions on this in the Makefile. Once you have set up the CFLAGS and DFLAGS statements to be the way you want, to compile all the programs just type:

make install

You will then see the compiling commands as they happen, with occasional warning messages. If these are warnings, rather than errors, they are not too serious. A typical warning would be like this:

dnaml.c:1204: warning: static declaration for re_move follows non-static

After a time the compiler will finish compiling. If you have done a make install the system will then move the executables into the exe folder and also save space by erasing all the relocatable object files that were produced in the process. You should be left with useable executables in the exe folder, and the src folder should be as before. To run the executables, go into the exe folder and type the program name (say dnaml, which you may or may not have to precede by a dot and a slash./). The names of the executables will be the same as the names of the C programs, but without the .c suffix. Thus dnaml.c compiles to make an executable called dnaml.

Our two tree-drawing programs, Drawgram and Drawtree, require an X Windows installation including the Athena Widgets. These are provided with most X Windows installations.

If you see messages that the compilation could not find "Xlib.h" and other, similar functions, this means that some parts of the X Windows development environment is not installed on your system, or is not installed in the default location. Similarly, if you get error messages saying that some files with "Xaw" in the name cannot be found, this means that the Athena Widgets are not installed on your system, or are not installed in the default location.

In either case, you will need to make sure that they are installed properly. If they are there but not found during the compile, change the DFLAGS and DLIBS variables in the Makefile to point to the locations of the header files and libraries, respectively.

Another is that the usual Linux C compiler is the Gnu GCC compiler. In some Linux systems it is not invoked by the command cc but by gcc. You would then need to edit the Makefile to reflect this (see below for comments on that process).

A typical Unix or Linux installation would put the directory phylip-3.68 in /usr/local. The name of the executables directory EXEDIR could be changed to be /usr/local/bin, so that the make install command puts the executables there. If the users have /usr/local/bin in their paths, the programs would be found when their names are typed. The font files font1 through font6 could also be placed there. A batch script containing the lines

ln -s /usr/local/bin/font1 font1

ln -s /usr/local/bin/font2 font2

ln -s /usr/local/bin/font3 font3

ln -s /usr/local/bin/font4 font4

ln -s /usr/local/bin/font5 font5

ln -s /usr/local/bin/font6 font6

could be used to establish links in the user's working directory so that Drawtree and Drawgram would find these font files when users type a name such as font1 when the program asks them for a font file name. The documentation web pages are in subdirectory doc of the main PHYLIP directory, except for one, phylip.html which is in the main PHYLIP directory. It has a table of all of the documentation pages, including this one. If users create a bookmark to that page it can be used to access all of the other documentation pages.

To compile just one program, such as Dnaml, type:

make dnaml

After this compilation, dnaml will be in the src subdirectory. So will some relocatable object code files that were used to create the executable. These have names ending in .o - they can safely be deleted.

If you have problems with the compilation command, you can edit the Makefile. It has careful explanations at its front of how you might want to do so. For example, you might want to change the C compiler name cc to the name of the Gnu C compiler, gcc. This can be done by removing the comment character # from the front of one line, and placing it at the front of a nearby line. How to do so should be clear from the material at the beginning of the Makefile. We have included sample lines for using the gcc compiler and for using the Cygwin Gnu C++ environment on Windows, as well as the default of cc.

We have encountered some problems with the Gnu C Compiler (gcc) on 64-bit Itanium processors when compiled with the the -O 3 optimization level, in our code for generating random numbers.

Some older C compilers (notably the Berkeley C compiler which is included free with some Sun systems) do not adhere to the ANSI C standard (because they were written before it was set down). They have trouble with the function prototypes which are in our programs. We have included an #ifndef preprocessor command to eliminate the problem, if you use the switch -DOLDC when compiling. Thus with these compilers you need only use this in your C flags (in the Makefile) and compilers such as Berkeley C will cause no trouble.

Parallel computers

As parallel computers become more common, the issue of how to compile PHYLIP for them has become more pressing. People have been compiling PHYLIP for vector machines and parallel machines for many years. We have not made a version for parallel machines because there is still no standard parallel programming environment on such machines (or rather, there are many standards, so that one cannot find one that makes a parallel execution version of PHYLIP widely distributable). However symmetric multiprocessing using the MPI Message Passing Interface is spreading rapidly, and we will probably support it in future versions of PHYLIP.

Although the underlying algorithms of most programs, which treat sites independently, should be amenable to vector and parallel processors, there are details of the code which might best be changed. In certain of the programs (Dnaml, Dnamlk, Proml, Promlk) I have put a special comment statement next to the loops in the program where the program will spend most of its time, and which are the places most likely to benefit from parallelization. This comment statement is:

/* parallelize here */

In particular within these innermost loops of the programs there are often scalar quantities that are used for temporary bookkeeping. These quantities, such as sum1, sum2, zz, z1, yy, y1, aa, bb, cc, sum, and denom in procedure makenewv of Dnaml and similar quantities in procedure nuview) are there to minimize the number of array references. For vectorizing and parallelizing compilers it will be better to replace them by arrays so that processing can occur simultaneously.

If you succeed in making a parallel version of PHYLIP we would like to know how you did it. In particular, if you can prepare a web page which describes how to do it for your computer system, we would like to use material from it in our PHYLIP web pages. Please e-mail it to me. We hope to have a set of pages that give detailed instructions on how to make parallel version of PHYLIP on various kinds of machines. Alternatively, if we were given your modified version of the program we might be able to figure out how to make modifications to our source code to allow users to compile the program in a way which makes those modifications.

Other computer systems

As you can see from the variety of different systems on which these programs have been successfully run, there are no serious incompatibility problems with most computer systems. PHYLIP in various past Pascal versions has also been compiled on 8080 and Z80 CP/M Systems, Apple II systems running UCSD Pascal, a variety of minicomputer systems such as DEC PDP-11's and HP 1000's, on 1970's era mainframes such as CDC Cyber systems, and so on. In a later era it was also compiled on IBM 370 mainframes, and of course on DOS and Windows systems and on Macintosh systems. We have gradually accumulated experience on a wider variety of C compilers. If you succeed in compiling the C version of PHYLIP on a different machine or a different compiler, I would like to hear the details so that I can consider including the instructions in a future version of this manual.

Frequently Asked Questions

This set of Frequently Asked Questions, and their answers, is from the PHYLIP web site. A more up-to-date version can be found there, at:

http://evolution.gs.washington.edu/phylip/faq.html

Problems that are encountered

"The program reads my data file and then says it has a memory allocation error!"

This is what tends to happen if there is a problem with the format of the data file, so that the programs get confused and think they need to set aside memory for 1,000,000 species or so. The result is a "memory allocation error" (the error message may say that "the function asked for an inappropriate amount of memory"). Check the data file format against the documentation: make sure that the data files have not been saved in the format of your word processor (such as Microsoft Word) but in a "flat ASCII" or "text only" mode. Note that adding memory to your computer is not the way to solve this problem -- you probably have plenty of memory to run the program once the data file is in the correct format.

"One program makes an output file and then the next program crashes while reading it!"

Did you rename the file? If a program makes a file called outfile, and then the next program is told to use outfile as its input file, terrible things will happen. The second program first opens outfile as an output file, thus erasing it. When it then tries to read from this empty outfile a psychological crisis ensues. The solution is simply to rename outfile before trying to use it as an input file.

"Consense gives wierd branch lengths! How do I get more reasonable ones?"

Consense gives branch lengths which are simply the numbers of replicates that support the branch. This is not a good reflection of how long those branches are estimated to be. The best way to put better branch lengths on a consensus tree is to use it as a User Tree in a program that will estimate branch lengths for it, such as DnaML. You may need to convert it to being an unrooted tree, using Retree, first. If the original program you were using was a program that does not estimate branch lengths, you may instead have to use one that does. You can use a likelihood program, or make some distances between your species (using, for example, Dnadist) and use Fitch to put branch lengths on the user tree. Here is the sequence of steps you should go through:

  1. 1.Take the tree and use Retree to make sure it is Unrooted (just read it into Retree and then save it, specifying Unrooted)

  2. 2.Use the unrooted tree as a User Tree (option U) in one of our programs (such as Dnaml or Fitch). If you use Fitch, you also first need to use one of the distance programs such as Dnadist to compute a set of distances to serve as its input.

  3. 3.Specify that the branch lengths of the tree are not to be used but should be re-estimated. This is actually the default.

"I looked at the tree printed in the output file outfile and it looked wierd. Do I always need to look at it in Drawgram?"

It's possible you are using the wrong font for looking at the tree in the output file. The tree is drawn with dashes and exclamation points. If a proportional font such as Times Roman or Helvetica is used, the tree lines may not connect. Try selecting the whole tree and setting the font to a fixed-width one such as Courier. You may be astounded how much clearer the tree has become.

"DrawTree (or DrawGram) doesn't work: it can't find the font file!"

Six font files, called font1 through font6, are distributed with the executables (and with the source code too). The program looks for a copy of one of them called fontfile. If you haven't made such a copy called fontfile it then asks you for the name of the font file. If they are in the current folder, just type one of font1 through font6. The reason for having the program look for fontfile is so that you can copy your favorite font file, call the copy fontfile, and then it will be found automatically without you having to type the name of the font file each time.

"Can Drawgram draw a scale beside the tree? Print the branch lengths as numbers?"

It can't do either of these. Doing so would make the program more complex, and it is not obvious how to fit the branch length numbers into a tree that has many very short internal branches. If you want these scales or numbers, choose an output plot file format (such as Postscript, PICT or PCX) that can be read by a drawing program such as Adobe Illustrator, Freehand, Canvas, CorelDraw, or MacDraw. Then you can add the scales and branch length numbers yourself by hand. Note the menu option in Drawtree and Drawgram that specifies the tree size to be a given number of centimeters per unit branch length.

"How can I get Drawgram or Drawtree to print the bootstrap values next to the branches?"

When you do bootstrapping and use Consense, it prints the bootstrap values in its output file (both in a table of sets, and on the diagram of the tree which it makes). These are also in the output tree file of Consense. There they are in place of branch lengths. So to get them to be on the output of Drawgram or Drawtree, you must write the tree in the format of a drawing program and use it to put the values in by hand, as mentioned in the answer to the previous question.

"I have an HP laser printer and can't get DrawGram to print on it"

Drawgram and Drawtree produce a plot file (called plotfile): they do not send it to the printer. It is up to you to get the plot file to the printer. If you are running Windows this can probably be done with the Command tool and the command COPY/B PLOTFILE PRN:, unless your printer is a networked printer. The /B is important. If it is omitted the copy command will strip off the highest bit of each byte, which can cause the printing to fail or produce garbage.

"Dnaml won't read the treefile that is produced by Dnapars!"

That's because the Dnapars tree file is a rooted tree, and Dnaml wants an unrooted tree. Try using Retree to change the file to be an unrooted tree file. Our most recent versions of the programs usually automatically convert a rooted tree into an unrooted one as needed. But the programs such as Dnamlk or Dollop that need a rooted tree won't be able to use an unrooted tree.

"In bootstrapping, Seqboot makes too large a file"

If there are 1000 bootstrap replicates, it will make a file 1000 times as long as your original data set. But for many methods there is another way that uses much less file space. You can use Seqboot to make a file of multiple sets of weights, and use those together with the original data set to do bootstrapping.

"In bootstrapping, the output file gets too big."

When running a program such as Neighbor or Dnapars with multiple data sets (or multiple weights) for purposes of bootstrapping, the output file is usually not needed, as it is the output tree file that is used next. You can use the menu of the program to turn off the writing of trees into the output file. The trees will still be written into the output tree file.

"Why don't your programs correctly read the sequence alignment files produced by ClustalW?"

They do read them correctly if you make the right kind. Files from ClustalV or ClustalW whose names end in ".aln" are not in PHYLIP format, but in Clustal's own format which will not work in PHYLIP. You need to find the option to output PHYLIP format files, which ClustalW and ClustalV usually assign the extension .phy.

"Why doesn't Neighbor read my DNA sequences correctly?"

Because it wants to have as input a distance matrix, not sequences. You have to use Dnadist to make the distance matrix first.

How to make it do various things

"How do I bootstrap?"

The general method of bootstrapping involves running Seqboot to make multiple bootstrapped data sets out of your one data set, then running one of the tree-making programs with the Multiple data sets option to analyze them all, then running Consense to make a majority rule consensus tree from the resulting tree file. Read the documentation of Seqboot to get further information. Before, only parsimony methods could be bootstrapped. With this new system almost any of the tree-making methods in the package can be bootstrapped. It is somewhat more tedious but you will find it much more rewarding.

"How do I specify a multi-species outgroup with your parsimony programs?"

It's not a feature but is not too hard to do in many of the programs. In parsimony programs like Mix, for which the W (Weights) and A (Ancestral states) options are available, and weights can be larger than 1, all you need to do is:

(a)

In Mix, make up an extra character with states 0 for all the outgroups and 1 for all the ingroups. If using
Dnapars the ingroup can have (say) G and the outgroup A.

(b)

Assign this character an enormous weight (such as Z for 35) using the W option,
all other characters getting weight 1, or whatever weight they had before.

(c)

If it is available, Use the A (Ancestral states) option to designate that for that new character the state found in the
outgroup is the ancestral state.

(d)

In Mix do not use the O (Outgroup) option.

(e)

After the tree is found, the designated ingroup should have been held together by the fake character. The tree will be
rooted somewhere in the outgroup (the program may or may not have a preference for one place in the outgroup over another).
Make sure that you subtract from the total number of steps on the tree all steps in the new character.

In programs like Dnapars, you cannot use this method as weights of sites cannot be greater than 1. But you do an analogous trick, by adding a largish number of extra sites to the data, with one nucleotide state ("A") for the ingroup and another ("G") for the outgroup. You will then have to use Retree to manually reroot the tree in the desired place.

"How do I force certain groups to remain monophyletic in your parsimony programs?"

By the same method as in the previous question, using multiple fake characters, any number of groups of species can be forced to be monophyletic. In Move, Dolmove, and Dnamove you can specify whatever outgroups you want without going to this trouble.

"How can I reroot one of the trees written out by PHYLIP?"

Use the program Retree. But keep in mind whether the tree inferred by the original program was already rooted, or whether you are free to reroot it without changing its meaning.

"What do I do about deletions and insertions in my sequences?"

The molecular sequence programs will accept sequences that have gaps (the "-" character). They do various things with them, mostly not optimal. Programs such as Dnaml and Dnadist count gaps as equivalent to unknown nucleotides (or unknown amino acids) on the grounds that we don't know what would be there if something were there. This completely leaves out the information from the presence or absence of the gap itself, but does not bias the gapped sequence to be close to or far from other gapped or ungapped sequences. Sequences that share a gap at a site do not tend to cluster together on the tree. So it is not necessary to remove gapped regions from your sequences, unless the presence of gaps indicates that the region is badly aligned. An exception to this is Dnapars, which counts "gap" as if it were a fifth nucleotide state (in addition to A, C, G, and T). Each site counts one change when a gap arises or disappears. The disadvantage of this treatment is that a long gap will be overweighted, with one event per gapped site. So a gap of 10 nucleotides will count as being as much evidence as 10 single site nucleotide substitutions. If there are not overlapping gaps, one way to correct this is to recode the first site in the gap as "-" but make all the others be "?" so the gap only counts as one event.

"How can I produce distances for my data set which has 0's and 1's?"

You can't do it in a simple and general way, for a straightforward reason. Distance methods must correct the distances for superimposed changes. Unless we know specifically how to do this for your particular characters, we cannot accomplish the correction. There are many formulas we could use, but we can't choose among them without much more information. There are issues of superimposed changes, as well as heterogeneity of rates of change in different characters. Thus we have not provided a distance program for 0/1 data. It is up to you to figure out what is an appropriate stochastic model for your data and to find the right distance formulas.

"I have RFLP fragment data: which programs should I use?"

This is a more difficult question than you may imagine. Here is quick tour of the issues:

  • •.You can code fragments as 0 and 1 and use a parsimony program. It is not obvious in advance whether 0 or 1 is ancestral, though it is likely that change in one direction is more likely than change in the other for each fragment. One can use either Wagner parsimony (programs PARS, Mix, Penny or Move) or use Dollo parsimony (Dollop, Dolpenny or Dolmove) with the ancestral states all set as unknown ("?").

  • •.You can use a distance matrix method using the RFLP distance of Nei and Li (1979). Their restriction fragment distance is available in our program RestDist.

  • •.You should be very hesitant to bootstrap RFLP's. The individual fragments do not evolve independently: a single nucleotide substitution can eliminate one fragment and create two (or vice versa).

For restriction sites (rather than fragments) life is a bit easier: they evolve nearly independently so bootstrapping is possible and Restml can be used, as well as restriction sites distances computed in Restdist. Also directionality of change is less ambiguous when parsimony is used. A more complete tour of the issues for restriction sites and restriction fragments is given in chapter 15 of my book (Felsenstein, 2004).

"Why don't your parsimony programs print out branch lengths?"

Well, Dnapars and Pars can. The others have not yet been upgraded to the same level. The longer answer is that it is because there are problems defining the branch lengths. If you look closely at the reconstructions of the states of the hypothetical ancestral nodes for almost any data set and almost any parsimony method you will find some ambiguous states on those nodes. There is then usually an ambiguity as to which branch the change is actually on. Other parsimony programs resolve this in one or another arbitrary fashion, sometimes with the user specifying how (for example, methods that push the changes up the tree as far as possible or down it as far as possible). Our older programs leave it to the user to do this. In Dnapars and PARS we use an algorithm discovered by Hochbaum and Pathria (1997) (and independently by Wayne Maddison) to compute branch lengths that average over all possible placements of the changes. But these branch lengths, as nice as they are, do not correct for mulitple superimposed changes. Few programs available from others currently correct the branch lengths for multiple changes of state that may have overlain each other. One possible way to get branch lengths with nucleotide sequence data is to take the tree topology that you got, use Retree to convert it to be unrooted, prepare a distance matrix from your data using Dnadist, and then use Fitch with that tree as User Tree and see what branch lengths it estimates.

"Why can't your programs handle unordered multistate characters?"

In this 3.6 release there is a program Pars which does parsimony for undordered multistate characters with up to 8 states, plus ?. The other the discrete characters parsimony programs can only handle two states, 0 and 1. This is mostly because I have not yet had time to modify them to do so - the modifications would have to be extensive. Ultimately I hope to get these done. If you have four or fewer states and need a feature that is not in Pars, you could recode your states to look like nucleotides and use the parsimony programs in the molecular sequence section of PHYLIP, or you could use one of the excellent parsimony programs produced by others.

Background information needed:

"What file format do I use for the sequences?"
"How do I use the programs? I can't find any documentation!"

These are discussed in the documentation files. Do you have them? If you have a copy of this page you probably do. They may be in a separate archive from the executables (in which case they are in the Documentation and Sources archives, which you should definitely fetch). Input file formats are discussed in main.html, in sequence.html, distance.html, contchar.html, discrete.html, and the documentation files for the individual programs.

Questions about distribution and citation:

"If I copied PHYLIP from a friend without you knowing, should I try to keep you from finding out?"

No. It is to your advantage and mine for you to let me know. If you did not get PHYLIP "officially" from me or from someone authorized by me, but copied a friend's version, you are not in my database of users. You may also have an old version which has since been substantially improved. I don't mind you "bootlegging" PHYLIP (it's free anyway), but you should realize that you may have copied an outdated version. If you are reading this Web page, you can get the latest version just as quickly over Internet. It will help both of us if you get onto my mailing list. If you are on it, then I will give your name to other nearby users when they ask for the names of nearby users, and they are urged to contact you and update your copy. (I benefit by getting a better feel for how many distributions there have been, and having a better mailing list to use to give other users local people to contact). Use the registration form which can be accessed through our web site's registration page.

"Can I make copies of PHYLIP available to the students in my class?"

Generally, yes. Read the Copyright notice near the front of this main documentation page. If you charge money for PHYLIP, other than a minimal charge to cover cost of distribution, or you use it in a service for which you charge money, you will need to negotiate a royalty. But you can make it freely available and you do not need to get any special permission from us to do so.

Additional Frequently Asked Questions, or: "Why didn't it occur to you to ...

... allow the options to be set on the command line?"

We could in Unix and Linux, or somewhat differently in Windows. But there are so many options that this would be difficult, especially when the options require additional information to be supplied such as rates of evolution for many categories of sites. You may be asking this question because you want to automate the operation of PHYLIP programs using batch files (command files) to run in background. If that is the issue, see the section of this main documentation page on "Running the programs in background or under control of a command file". It explains how to set the options using input redirection and a file that has the menu responses as keystrokes.

... include in the package a program to do the Distance Wagner method, (or successive approximations character weighting)?"

In most cases where I have not included other methods, it is because I decided that they had no substantial advantages over methods that were included (such as the programs Fitch, Kitsch, Neighbor, the T option of Mix and Dollop, and the "?" ancestral states option of the discrete characters parsimony programs).

... include in the package ordination methods and more clustering algorithms?"

Because this is not a clustering package, it's a package for phylogeny estimation. Those are different tasks with different objectives and mostly different methods. Mary Kuhner and Jon Yamato have, however, included in Neighbor an option for UPGMA clustering, which will be very similar to Kitsch in results.

... include in the package a program to do nucleotide sequence alignment?"

Well, yes, I should have, and this is scheduled to be in future releases. But multiple sequence alignment programs, in the era after Sankoff, Morel, and Cedergren's 1973 classic paper, need to use substantial computer horsepower to estimate the alignment and the tree together (but see Karl Nicholas's program GeneDoc or Ward Wheeler and David Gladstein's MALIGN, as well as more approximate methods of tree-based alignment used in ClustalW, TreeAlign, or POY).

New Features in This Version

Version 3.6 has many new features:

  • •.Faster (well, less, slow) likelihood programs.

  • •.The DNA and protein likelihood and distance programs allow for rate variation between sites using a gamma distribution of rates among sites, or using a gamma distribution plus a given fraction of sites which are assumed invariant.

  • •.A new multistate discrete characters parsimony program, Pars, that handles unordered multistate characters.

  • •.The Dnapars and Pars parsimony programs can infer multifurcating trees, which sensibly reduces the number of tied trees they find.

  • •.A new protein sequence likelihood program, Proml, and also a version, Promlk which assumes a molecular clock.

  • •.A new restriction sites and restriction fragments distance program, Restdist, that can also be used to compute distances for RAPD and AFLP data. It also allows for gamma-distributed rate variation among DNA sites.

  • •.In the DNA likelihood programs, you can now specify different categories of rates of change (such as rates for first, second, and third positions of a coding sequence) and assign them to specific sites. This is in addition to the ability of the program to use the Hidden Markov Model mechanism to allow rates of change to vary across sites in a way that does not ask you to assign which rate goes with which site.

  • •.The input files for many of the programs are now simpler, in that they do not contain options information such as specification of weights and categories. That information is now provided in separate files with default names such as weights and categories.

  • •.The DNA likelihood programs can now evaluate multifurcating user trees (option U).

  • •.All programs that read in user-defined trees now do so from a separate file, whose default name is intree, rather than requiring them to be in the input file as before.

  • •.The DNA likelihood programs can infer the sequence at ancestral nodes in the interior of the tree.

  • •.Dnapars can now do transversion parsimony.

  • •.The bootstrapping program Seqboot now can, instead of producing a large file containing multiple data sets, be asked instead to produce a weights file with multiple sets of weights. Many programs in this release can analyze those multiple weights together with the original data set, which saves disk space.

  • •.The bootstrapping program Seqboot can pass weights and categories information through to a multiple weights file or a multiple categories file.

  • •.Seqboot can also convert sequence files from Interleaved to Sequential form, or back.

  • •.Seqboot can convert a PHYLIP molecular sequences or discrete characters morphology data file into the NEXUS format, which is used by a number of other phylogeny programs such as MacClade, MrBayes and PAUP*.

  • •.Seqboot can also carry out a number of different methods of permuting the order of characters in a data set. This could be used to carry out the Incongruence Length Difference (or Partition Homogeneity) method of testing homogeneity of data sets.

  • •.Seqboot can also write a sequence data file into one version of an XML format for sequence alignments, for use by programs that need XML input (none of the current PHYLIP programs can yet use this format, but it may be useful in the future).

  • •.Retree can now write tree out into a preliminary version of a new XML tree file format which is in the process of being defined.

  • •.The Kishino-Hasegawa-Templeton (KHT) test which compares user-defined trees (option U) is now joined by the Shimodaira-Hasegawa (SH) test (Shimodaira and Hasegawa, 1999) which corrects for comparisons among multiple tests. This avoids a statistical problem with multiple user trees.

  • •.Contrast can now carry out an analysis that takes into account within-species variation, according to a model similar (but not identical) to that introduced by Michael Lynch (1990). This enables analysis of individuals sampled from the species, in a way that properly takes sampling error into account.

  • •.A new program, Treedist, computes the Robinson-Foulds symmetric difference distance among trees. This measures the number of branches in the trees that are present in one but not the other. It also can compute the Branch Score distance defined by Kuhner and Felsenstein (1994) which takes branch lengths into account.

  • •.Fitch and Kitsch now have an option to make trees by the minimum evolution distance matrix method.

  • •.The protein parsimony program Protpars now allows you to choose among a number of different genetic codes such as mitochondrial codes.

  • •.The consensus tree program Consense can compute the Ml family of consensus tree methods, which generalize the Majority Rule consensus tree method. It can also compute our extended Majority Rule consensus (which is Majority Rule with some additional groups added to resolve the tree more completely), and it can also compute the original Majority Rule consensus tree method which does not add these extra groups. It can also compute the Strict consensus.

  • •.The tree-drawing programs Drawgram and Drawtree have a number of new options of kinds of file they can produce, including Windows Bitmap files, files for the Idraw and FIG X windows drawing programs, the POV ray-tracer, and even VRML Virtual Reality Markup Language files that will enable you to wander around the tree using a VRML plugin for your browser, such as Cosmo Player or Cortona.

  • •.Drawtree now uses my new Equal Daylight Algorithm to draw unrooted trees. This gives a much better-looking tree. Of course, competing programs such as TREEVIEW and PAUP draw trees that look just as good - because they too have started to use my method (with my encouragement). Drawtree also can use another algorithm, the n-body method.

  • •.The tree-drawing programs can now produce trees across multiple pages, which is handy for looking at trees with very large numbers of tips, and for producing giant diagrams by pasting together multiple sheets of paper.

There are many more, lesser features added as well.

Coming Attractions, Future Plans

There are some obvious deficiencies in this version. Some of these holes will be filled in the next few releases (leading to version 4.0). They include:

  1. 1.Obviously we need to start thinking about a more visual mouse/windows interface, but only if that can be used on X windows, Macintoshes, and Windows.

  2. 2.Program Penny and its relatives will improved so as to run faster and find all most parsimonious trees more quickly.

  3. 3.An "evolutionary clock" version of Contml will be done, and the same may also be done for Restml.

  4. 4.We are gradually generalizing the tree structures in the programs to infer multifurcating trees as well as bifurcating ones. We should be able to have any program read any tree and know what to do with it, without the user having to fret about whether an unrooted tree was fed to a program that needs a rooted tree.

  5. 5.In general, we need more support for protein sequences, including a codon model of change, allowing for different rates for synonymous and nonsynonymous changes.

  6. 6.We also need more support for combining runs from multiple loci, allowing for different rates of evolution at the different loci.

  7. 7.We will be expanding our use and production of XML data set files and XML tree files.

  8. 8.A program to align molecular sequences on a predefined User Tree may ultimately be included. This will allow alignment and phylogeny reconstruction to procede iteratively by successive runs of two programs, one aligning on a tree and the other finding a better tree based on that alignment. In the shorter run a simple two-sequence alignment program may be included.

  9. 9.An interactive "likelihood explorer" for DNA sequences will be written. This will allow, either with or without the assumption of a molecular clock, trees to be varied interactively so that the user can get a much better feel for the shape of the likelihood surface. Likelihood will be able to be plotted against branch lengths for any branch.

  10. 10.If possible we will allow use of Hidden Markov Models for correcting for purine/pyrimidine richness variations among species, within the framework of the maximum likelihood programs. That the maximum likelihood programs do not allow for base composition variation is their major limitation at the moment.

  11. 11.The Hidden Markov Model (regional rates) option of Dnaml and Dnamlk will be generalized to allow for rates at sites to gradually change as one moves along the tree, in an attempt to implement Fitch and Markowitz's (1970) notion of "covarions".

  12. 12.A more sophisticated compatibility program should be included, if I can find one.

  13. 13.We are economizing on the size of the source code, and enforcing some standardization of it, by putting frequently used routines in separate files which can be linked into various programs. This will enforce a rather complete standardization of our code.

  14. 14.We will move our code to an object-oriented language, most likely C++. One could describe the language that version 3.4 was written in as "Pascal", version 3.5 as "Pascal written in C", version 3.6 as "C written in C", and maybe version 4.0 as "C++ written in C" and then 4.1 as "C++ written in C++". At least that scenario is one possibility.

There will also be many future developments in the programs that treat continuously-measured data (quantitative characters) and morphological or behavioral data with discrete states, as I have new ideas for analyzing these data in ways that connect to within-species quantitative genetic analyses. This will compete with parsimony analysis.

Other Phylogeny Programs Available Elsewhere

A comprehensive list of phylogeny programs is maintained at the PHYLIP web site on the Phylogeny Programs pages:

http://evolution.gs.washington.edu/phylip/software.html

Here we will simply mention some of the major general-purpose programs. For many more and much more, see those web pages.

PAUP* A comprehensive program with parsimony, likelihood, and distance matrix methods. It competes with PHYLIP to be responsible for the most trees published. Written by David Swofford and distributed by Sinauer Associates of Sunderland, Massachusetts. It is described in a web page. at http://www.sinauer.com/detail.php?id=8060. Current prices are $100 for the Macintosh version, $85 for the Windows version, and $150 for Unix versions for many kinds of workstations.

MrBayes The leading program for Bayesian inference of phylogenies. It uses Markov Chain Monte Carlo inference to assess support for clades and to infer posterior distrubutions of parameters. Produced by John Huelsenbeck and Fredrik Ronquist, it is available at its web site at http://mrbayes.net as a Mac OS X or Windows executable, or in source code in C.

MEGA A program by Sudhir Kumar of Arizona State University (written together with Koichiro Tamura, Joel Dudley and Masatoshi Nei). It can carry out parsimony and distance matrix methods for DNA sequence data. Version 4 for Windows, Macintosh, and Linux can be downloaded from the MEGA web site at http://www.megasoftware.net.

MacClade An interactive Macintosh program to rearrange trees and watch the changes in the fit of the trees to data as judged by parsimony. MacClade has a great many features including a spreadsheet data editor and many different descriptive statistics for different kinds of data. It is particularly designed to export and import data to and from PAUP*. MacClade is available for $125 from Sinauer Associates, of Sunderland, Massachusetts. It is described in a web page at http://www.sinauer.com/detail.php?id=4707 . MacClade is also described on its Web page, at http://phylogeny.arizona.edu/macclade/macclade.html.

PAML Ziheng Yang of the Department of Genetics and Biometry at University College, London has written this package of programs to carry out likelihood analysis of DNA and protein sequence data. It is one of the only packages able to use the codon model for protein sequence data which takes the genetic code reasonably fully into account. PAML is particularly strong in the options for coping with variability of rates of evolution from site to site, though it is less able than some other packages to search effectively for the best tree. It is available as C source code and as Macintosh and Windows executables from its web site at http://abacus.gene.ucl.ac.uk/software/paml.html .

TREE-PUZZLE This package by Korbinian Strimmer, Heiko Schmidt and Arndt von Haeseler was begun when Von Haeseler and Strimmer were at the Universität Munchen in Germany. TREE-PUZZLE can carry out likelihood methods for DNA and protein data, searching by the strategy of "quartet puzzling" which they invented. It can also compute distances. It superimposes trees estimated from many quartets of species. TREE-PUZZLE is available for Unix, Macintoshes, or Windows from their web site at http://www.tree-puzzle.de/ .

DAMBE A package written by Xuhua Xia of the Department of Biology of the University of Ottawa. Its initials stand for Data Analysis in Molecular Biology and Evolution. DAMBE is a general-purpose package for DNA and protein sequence phylogenies. It can read and convert a number of file formats, and has many features for descriptive statistics, and can compute a number of commonly-used distance matrix measures and infer phylogenies by parsimony, distance, or likelihood methods, including bootstrapping and jackknifing. There are a number of kinds of statistical tests of trees available and it can also display phylogenies. DAMBE includes a copy of ClustalW as well; DAMBE consists of Windows executables. It is available from its web site at http://dambe.bio.uottawa.ca/dambe.asp .

NONA Pablo Goloboff, of the Instituto Miguel Lillo in Tucumán, Argentina has written this very fast parsimony program, capable of some relevant forms of weighted parsimony, which can handle either DNA sequence data or discrete characters. It is available with some companion programs from http://www.cladistics.com/aboutNona.htm .

TNT This program, by Pablo Goloboff, J. S. Farris, and Kevin Nixon, is for searching large data sets for most parsimonious trees. The authors are respectively at the Instituto Miguel Lillo in Tucumán, Argentina, the Naturhistoriska Riksmuseet in Stockholm, Sweden, and the Hortorium, Cornell University, Ithaca, New York. TNT is described as faster than other methods, though not faster than NONA for small to medium data sets. It is distributed as Windows, Linux, and Mac OS X executables (the latter two require the PVM Parallel Virtual Machine library to be installed). The program and some support files including documentation are available from its download area at http://www.zmuc.dk/public/phylogeny/tnt (see the ReadMe! web page there). It is free, provided you agree to a license with some reasonable limitations.

These are only a few of the over 383 different phylogeny packages that are now available (as of July, 2008 - the number keeps increasing). The others are described (and web links and ftp addresses provided) at my Phylogeny Programs web pages at the address given above.

How You Can Help Me

Simply let me know of any problems you have had adapting the programs to your computer. I can often make "transparent" changes that, by making the code avoid the wilder, woolier, and less standard parts of C, not only help others who have your machine but even improve the chance of the programs functioning on new machines. I would like fairly detailed information on what gave trouble, on what operating system, machine, and (if relevant) compiler, and what had to be done to make the programs work. I am sometimes able to do some over-the-telephone trouble-shooting, particularly if I don't have to pay for the call, but electronic mail is a the best way for me to be asked about problems, as you can include your input and output files so I can see what is going on (please do not send them as Attachments, but as part of the body of a message). I'd really like these programs to be able to run with only routine changes on absolutely everything, down to and possibly including the Amana Touchmatic Radarange Microwave Oven which was an Intel 8080 system (in fact, early versions of this package did run successfully on Intel 8080 systems running the CP/M operating system). A PalmPilot version was contemplated too.

I would also like to know timings of programs from the package, when run on the three test input files provided above, for various computer and compiler combinations, so that I can provide this information in the section on speeds of this document.

For the phylogeny plotting programs Drawgram and Drawtree, I am particularly interested in knowing what has to be done to adapt them for other graphic file formats.

You can also be helpful to PHYLIP users in your part of the world by helping them get the latest version of PHYLIP from our web site and by helping them with any problems they may have in getting PHYLIP working on their data.

Your help is appreciated. I am always happy to hear suggestions for features and programs that ought to be incorporated in the package, but please do not be upset if I turn out to have already considered the particular possibility you suggest and decided against it.

In Case of Trouble

Read The (documentation) Files Meticulously ("RTFM"). If that doesn't solve the problem, please check the Frequently Asked Questions web page at the PHYLIP web site:

http://evolution.gs.washington.edu/phylip/faq.html

and the PHYLIP Bugs web page at that site:

http://evolution.gs.washington.edu/phylip/bugs.html

If none of these answers your question, get in touch with me. My email address is given below. If you do ask about a problem, please specify the program name, version of the package, computer operating system, and send me your data file so I can test the problem. Also it will help if you have the relevant output and documentation files so that you can refer to them in any correspondence. I can also be reached by telephone by calling me in my office: +1-(206)-543-0150, or at home: +1-(206)-526-9057 (how's that for user support!). If I cannot be reached at either place, a message can be left at the office of the Department of Genome Sciences, +1-(206)-221-7377 but I prefer strongly that I not call you, as in any phone consultation the least you can do is pay the phone bill. Better yet, use email.

Particularly if you are in a part of the world distant from me, you may also want to try to get in touch with other users of PHYLIP nearby. I can also, if requested, provide a list of nearby users.

Joe Felsenstein
Department of Genome Sciences
University of Washington
Box 355065
Seattle, Washington 98195-5065, U.S.A.

Electronic mail addresses: joe (at) gs.washington.edu

PHYLIP3.69文档翻译:主文档

节选了其中的部分内容进行翻译 。此文章的原文可在Kate dNA的博客上找到: http://stupidbeauty.com/KNA/2011/04/ phylip3-69文档翻译:主文档/

文章很长 ,所以一点点的翻译。当前进度:第 1 5 页,总页数: 5 2 .

PHYLIP

系统发生推论软件包(Phylogeny Inference Package

PHYLIP Logo

版本号3.69

2009年9月

作者Joseph Felsenstein

基因科学系与生物系
华盛顿大学
355065信箱
西 雅图,华盛顿 98195-5065
米国

电子邮件地址 joe (at) gs.washington.edu

这个文档的内容

内容目录

PHYLIP3.69文档翻译:主文档

PHYLIP

系统发生推论软件包(Phylogeny Inference Package)

版本号3.69

2009年9月

作者Joseph Felsenstein

电子邮件地址:joe (at) gs.washington.edu

这个文档的内容

对这些程序的简短介绍

文档文件以及如何阅读它们

这些程序做什么用

运行这些程序

说说输入文件。

在一个Unix 或者Linux 系统上运行这些程序。

在后台运行或者通过一个脚本文件来控制这些程序的运行

准备输入文件

输入和输出文件

数据文件的格式

The Menu

The Output File

The Tree File

The Options and How To Invoke Them

Common options in the menu

The Algorithm for Constructing Trees

Local rearrangements

Global rearrangements

Multiple jumbles

Saving multiple tied trees

Strategy for finding the best tree

A Warning on Interpreting Results

General Comments on Adapting

Compiling the programs

Unix and Linux

Parallel computers

Other computer systems

Frequently Asked Questions

Problems that are encountered

How to make it do various things

Background information needed:

Questions about distribution and citation:

Additional Frequently Asked Questions, or: "Why didn't it occur to you to ...

New Features in This Version

Coming Attractions, Future Plans

Other Phylogeny Programs Available Elsewhere

How You Can Help Me

In Case of Trouble

对这些程序的简短介绍

PHYLIP ,系统发生推论软件包 ,是一个由那些进行系统发生论 (进化树)推论的程序组成的软件包。它是从1980 年开始发布的 ,有 20,000 多个注册用户 ,使得它成为与系统发生论有关的发布得最广泛的 软件 包。可以从它的网站免费获得:

http://evolution.gs.washington.edu/phylip.html

PHYLIP以C 源代码的方式发布,另外还以某些常用的系统的可执行程序的方式发布 。它可使用以下方法进行系统发生论的推论 :简约法 、兼容性法、距离矩阵法和似然性法。它还可以计算一致树 (consensus trees)、计算树之间的距离、画树、通过自 举(bootstrapping)或剪切(jackknifing)来对数据集进行重新抽样、编辑树 、以及计算距离矩阵。它可以处理以下数据 :核苷酸序列 、蛋白质序列、基因频率、 限制性内切位点、 限制酶断片、距离、离散字符、和连续性字符。

文档文件以及如何阅读它们

PHYLIP带有丰富的文档 。其中包括主文档 (就是你在读的这个),你应当完整地读一读它 。另外 ,还有针对那些程序组的文档,包括针对 分子序列 程序组 距离矩阵程序组 基因频率及连续性字符程序组 离散字符程序组、和 画树程序组的文档。最后 ,每个程序都有它自己的文档。对所有文档的引用 都位于这个主文档里。你可以这样阅读它们

  1. 1.阅读这个主文档

  2. 2.试着决定哪些程序是你感兴趣的。

  3. 3.阅读包含这些程序的程序组的文档

  4. 4.阅读针对那些单个程序的文档

另外还有一个关于使用PHYLIP 3.6 的优秀教程。它是芬兰的埃斯坡的科学计算中心的Jarno Tuimala 写的,是PDF 文档,在 这里 下载。

这些程序做什么用

这里有一个针对每个程序的简短描述。对于更详细的说明 ,你当然应该阅读那个程序自己的文档以及那个程序所在的程序组的文档。在这个列表中 ,每个程序的名字都是一个链接,会带你到达那个程序的文档。注意,在PHYLIP 软件包中没有叫PHYLIP 的程序。

Protpars

使用简约法从蛋白质序列(使用标准的单个字母的氨基酸代码作为输入)估计系统发生史,并且以一种改动过 的方法来进行处理:只对那些会改变氨基酸的核苷酸变化进行计数,假设隐性变异 (silent changes)更容易发生。

Dnapars

使用简约法从核酸序列估计系统发生史。允许使用完整的 国际生物化学联合会(IUB)的模糊代码,并且估计祖先的核苷酸状态 。缺口被当作第五种核苷酸状态。它还可以对颠换使用简约法进行计算 。可处理多叉 (multifurcations)、重构祖先的状态 、使用0/1 的字符权重 、和推论分支长度

Dnamove

从核酸序列交互式地构建系统发生史,通过简约法和兼容性法来评估 ,并且显示出重构过的祖先的基 (ancestral bases) 。这个可以用来手动寻找简约法或者兼容性法的估计值。

Dnapenny

通过使用分支定界法来针对核酸序列寻找全部的最简约的系统发生史 (取决于数据) 这个程序对于多于10-11 个种族的情况可能不实用。

Dnacomp

使用兼容性标准从核酸序列估计出系统发生史,这个标准会搜索个数最多的能够让所有状态 (核苷酸)在同一棵树上唯一地进化的位点。当位点各自之间的进化概率差别很大时 ,兼容性就尤其适用,但是我们不能预先知道哪些是不那么可靠的位点

Dnainvar

4种种族上,针对核酸序列,计算Lake的和Cavender的系统发生论不变量,它们将会测试不同的树的拓扑。这个程序还会将不同的核苷酸模式的出现频率列成表格 。Lake的不变量方法 被他自己称作“进化简约法”。

Dnaml

针对核苷酸序列,使用最大似然法估计系统发生史。所使用的模型允许 4种核苷酸有不同的期望频率、不同的转换和颠换概率 、对不同种类的位点有不同 (预定义)的变化概率 ,还使用了一个 隐式 (Hidden)马尔可夫(Markov)概率模型,程序使用它来推断什么位点有什么概率。这也允许位点之间的概率的伽马分布 (gamma-distribution)和伽马加不变量位点分布(gamma-plus-invariant sites distributions)。

Dnamlk

与Dnaml 相同,但是会假设有一个分子时钟。同时使用这 2个程序就能够为分子时钟假设进行一个似然性测试。

Proml

使用最大似然法从蛋白质氨基酸序列估计系统发生史。可使用PAM 、JTT 或者PMB 模型, 还使用了一个 隐式(Hidden)马尔可夫(Markov)概率模型,程序使用它来推断什么位点有什么概率。这也允许位点之间的概率的伽马分布 (gamma-distribution)和伽马加不变量位点分布(gamma-plus-invariant sites distributions)。它还允许在已知的位点有着不同的改变概率

Promlk

与Proml 相同,但是 会假设有一个分子时钟。同时使用这 2个程序就能够为分子时钟假设进行一个似然性测试。

Dnadist

从核酸序列计算种族之间的4种不同的距离。这些距离可以用于那些距离矩阵程序 。这些距离是 Jukes-Cantor公式、一个基于Kimura的2-参数 (2- parameter 方法的距离、Dnaml 中使 用的F84 模型 、和LogDet 距离。这些距离还可以在不同的位点针对伽马分布的和伽马加不变量分布的变化概率作修正 。进化概率可能按照一个预定义的方式依据不同的位点而发生变化 ,同时也依据一个隐式马尔可夫模型而变化。这个程序还可以列出序列之间的相似度的一个表格

Protdist

为蛋白质序列计算一个距离度量值,使用基于Dayhoff PAM 矩阵、JTT 矩阵模型、PBM 模型、Kimura的1983 近似法、或者是一个基于基因代码的模型加上一个变化为另一种氨基酸的约束的最大似然性估计。 这些距离还可以在不同的位点针对伽马分布的和伽马加不变量分布的变化概率作修正 。进化概率可能按照一个预定义的方式依据不同的位点而发生变化 ,同时也依据一个隐式马尔可夫模型而变化。这个程序还可以列出序列之间的相似度的一个表格 这些距离可以用于那些距离矩阵程序

Restdist

从限制性位点数据或者限制性片断数据计算出来的距离。限制性位点选项也被用来为RAPD或者AFLP计算距离。

Restml

使用限制性位点数据(不是限制性片断而是单个位点的存在 /缺失 )通过最大似然性方法估计出来的系统发生史。它使用Jukes-Cantor 核苷酸变化对称模型 ,这个模型不允许转换和颠换有不同的概率 。这个程序 慢。

Seqboot

读入一个数据集,再使用自举重取样来从这个数据集产生多个数据集 。由于这个软件包的当前版本中大部分程序 都允许处理多个数据集,所以这个程序可以与一致树程序Consense 一起使用 ,与软件包中的大部分方法配合进行自举(或者半删除式剪切 delete-half-jackknife )分析。这个 程序还允许在字符之间的种族的排列的Archie/Faith 技术 。它还可以重写一个数据集 ,将它从PHYLIP 的交错 (Interleaved) 及序列化 (Sequential) 格式转换成一个新的初级版本的XML 序列比对格式 ,它个格式正在开发当中 ,并且在 Seqboot文档网页 中描述。

Fitch

“可添加的树模型”下的距离矩阵数据中估计系统发生史,根据这个模型 ,距离预期与种族之间的分支长度的和相等。使用Fitch-Margoliash 标准和某些相关的平方标准 ,或者是最小进化距离矩阵方法。不假设有一个进化时钟。这个程序对以下东西有用 :从分子序列 、限制性位点或者片断距离中计算出来的距离,DNA 杂交度量值,以及从基因频率中计算出来的基因距离。

Kitsch

“超度量(ultrametric)”模型下的距离矩阵数据中估计系统发生史,这个模型与可添加的树模型相同 ,唯一的不同就是假设有一个进化时钟。还可能使用Fitch-Margoliash 标准和其它的最小平方标准 ,或者是最小进化标准。 这个程序对以下东西有用 :从分子序列 、限制性位点或者片断距离中计算出来的距离,DNA 杂交度量值,以及从基因频率中计算出来的基因距离。

Neighbor

由Mary Kuhner 和John Yamato 做的对Saitou 和Nei的“邻居连接方法”和UPGMA(平均连接簇)方法的实现。邻居连接是一个距离矩阵方法 ,它在不假设有一个时钟的情况下产生一个无根树。UPGMA假设有一个时钟。分支长度没有使用最小平方标准进行优化,但是这些方法很快,因此可 以处理很大的数据集。

Contml

在某个模型下使用最大似然法从基因频率中估计系统发生史,在那个模型中 ,所有的分支 都是由于在没有新的变异的情况下发生的基因漂移而产生的。不假设有一个分子时钟。另一个分析这种数据 的方法就是计算Nei的基因距离 ,并且使用那些距离矩阵程序中的一个。这个程序还可以针对按照布朗运动模型进化的连续字符进行最大似然性分析 ,但是它假设那些基因 型按照相同的概率和不相关的方式进化,所以它不把通常的相关性基因型计算在内

Gendist

从基因频率数据计算3种不同的基因距离公式中的1种。这些公式是 Nei的基因距离、Cavalli-Sforza 弦测量 、和Reynolds 等人的基因距离。前者适合于那种新的变异在一个无限同 等位基因中性变异模型中出现的数据,后两者适合于那种没有变异只有单纯的基因漂移的模型 。距离 被写到一个文件里,所采用的格式适合于作为那些距离矩阵程序的输入。

Contrast

从一个树文件中读取一个树,并且读取一个有着连续性基因 型数据的数据集,再产生针对那些基因 型的独立比对,以用于任何的多变量的统计软件包 。还会为那些比对产生协方差 、回归和基因 型之间的相关性。当一个种群中有独立的显 型时,还可以为种间抽样变化而进行校正

Pars

多状态离散基因型简约法。最多允许8 个状态 (包括" ? ")。无法做Camin-Sokal 和Dollo 简约法计算 。可以处理多个分叉 、重构祖先的状态 、使用基因 型权重、以及推断分支长度。

Mix

使用某些针对有2个状态(0和1)的离散基因型数据的简约法来估计系统发生史。允许使用 Wagner 简约法 、Camin-Sokal 简约法或者这两者的任意混合。还重构祖先的状态并且允许基因 型的权重(不推断分支长度)。

Move

交互式地从有2个状态(0和1)的离散基因型数据中构造系统发生史。为那些系统发生史评估简约性和兼容性指标 ,并且显示出整棵树上的重构状态。这个程序可用来手动寻找简约性或者兼容性评估值。

Penny

为有2个状态的离散基因型数据寻找全部的最简约的系统发生史,针对Wagner 、Camin-Sokal 和混合的简约法指标使用分支定界法的精确搜索。对于多于10-11 个物种的数据 (取决于数据)可能不实用。

Dollop

针对有2个状态(0和1)的离散基因型数据使用Dollo 或者多态性简约法标准来估计系统发生史。 还重构祖先的状态并且允许基因 型的权重。Dollo简约法尤其适合于限制性位点数据;在将祖先的状态指定为未知的情况下它可能适用于限制性片断数据

Dolmove

交互式地针对有2个状态(0和1)的离散基因型数据使用Dollo 或者多态性简约法标准来估计系统发生史 为那些系统发生史评估简约性和兼容性指标 ,并且显示出整棵树上的重构状态。这个程序可用来手动寻找简约性或者兼容性评估值。

Dolpenny

为有2个状态的离散基因型数据寻找全部的最简约的系统发生史,针对Dollo或者多态 的简约法指标使用分支定界法的精确搜索。对于多于10-11 个物种的数据 (取决于数据)可能不实用。

Clique

针对有2个状态的离散基因型,寻找互相兼容的最大集团,以及它们所导向的系统发生史。最大集团(或者与最大的那个相差指定范围的大小的全部集团 )是使用一个非常快的分支定界搜索方法找到的。这个方法不允许有丢失的数据存在 。对那种情况 ,Pars 或者Mix 的 T (阈值 (Threshold) )选项可能是一个有用的替代品。兼容性方法在这种情况下尤其有用 :有些基因 型的质量很差,其它的质量很好,但是却不能预先知道哪些好、哪些差。

Factor

读入离散的多状态数据以及基因型状态树,再产生对应的有 2个状态(0和1)的数据集。由Christopher Meacham 编写 。这个程序以前用在Mix 中调整多状态的基因型,但是现在不必这么做了 ,因为有了PARS。

Drawgram

用多种用户可控制的格式来绘制有根的系统发生史、进化树 、环形树和物候图 。这个程序是交互式的 ,允许在PC 、Macintosh 或者X Windows 屏幕、或者在Tektronix 或Digital 图形终端上预览那棵树。最终的输出可以是一个为以下东西而格式化的文件 :那些绘图程序中的一个 、一个光线跟踪或者VRML 浏览器 、送往一个激光打印机 (例如Postscript 或者PCL 兼容打印机) 、图形屏幕或者终端 、笔式绘图器或者兼容图象的点阵打印机

Drawtree

与Drawgram 类似,但是绘制无根的系统发生史。

Treedist

计算树之间的分支分数(Branch Score)距离,它允许树的拓扑不相同 它还会使用分支长度。它还计算树之间的Robinson-Foulds 对称差异距离 ,这个距离也允许树的拓扑不相同 ,但是不使用分支长度。

Consense

使用多数规则一致树方法来计算一致树,它也使得你能轻松地找到严格一致树 。无法计算Adams 一致树 。树是以一种标准的嵌套式括号格式写在一个树文件里的 ,这个文件是由这个软件包中的很多树估计程序产生的 。这个程序可以用来在使用这个软件包中的很多方法进行自举分析时作为最后一步

Retree

读入一棵树(有必要的话会带有分支长度),让你重新指定树根、翻转某些分支 、改变物种名字和分支长度,再将结果输出。 可以用来在有根树和无根树之间转换,还可以将树输出到一个试验性的新的XML 树文件格式当中 ,这个格式还在开发当中 ,具体的描述在 Retree文档网页

运行这些程序

这一小节假设你已经拿到咯预编译版的PHYLIP(Windows Mac OS X 、或者 Linux版),或者你拿到咯源代码并且自己把它编译咯(Linux Unix Mac OS X Windows或者OpenVMS版本)。对于那些有预编译版的机器来说 ,通常不需要你搞到一个编译器或者亲自编译那些程序 。这个小节说的是怎 么运行那些程序 。在本文档稍后的地方会说明怎么下载并且安装PHYLIP (假如你还没做那个就已经在阅读这个文档的话)。一般地 ,你只会在下载并且安装咯PHYLIP 之后再读这个文档。

说说输入文件

对于所有这些类型的机器,很重点的一点是提前准备好将要给这些程序的输入文件(典型地是数据文件)。可以在任何编辑器中准备它们 ,但是要注意将它们保存为 纯文本 格式 (“平文本 ASCII”),而不是用像微软的Word 那样的文字编辑器所写的格式 (在微软Word 中 ,确保数据编码是 "US ASCII" ,因为使用任何的 Unicode 编码 都可能引起问题 )。你要自己阅读那些描述在程序中所使用的文件的格式的PHYLIP 文档 。在本文档的下一节里有一个部分的描述 。还可以通过运行一个以PHYLIP 格式输出的程序来获取这些输入文件 (这些程序中的某一些本身就能输出 ,还有别人写的一些程序 ,例如序列比对程序 ClustalW 和序列格式转换程序 Readseq)。在PHYLIP 里面 没有 任何程序提供输入文件的编辑器( 指望着启动其中某个程序再点一下鼠标就能创建一个数据文件)。

当这些程序开始运行时,它们首先寻找特定文件名 (例如 infile treefile intree或者fontfile 的输入文件。不同程序会寻找不同的文件名 ,你应当阅读对应程序的文档来搞清楚它们要用什么文件名 。如果你准备好咯拥有那些名字的文件 ,那么程序就会使用它们,不再向你询问文件名。如果它们没有找到拥有那些文件名的文件 ,那么程序就会告诉你说它们找不到某个名字的文件,并且要求 你输入文件名。例如 ,如果DnaML 寻找 infile 这个文件却没有找到,那么它会输出这条消息:

dnaml: can't find input file "infile"
Please enter a new file name>

这并不是说出现咯某个错误 你所需要做的就是输入文件名

程序在与自身相同的文件夹里寻找输入文件(文件夹跟 “目录”是同一个东西 。在WindowsMac OS XLinux或者Unix里,当程序向你询问文件名时 ,你可以将到达那个文件的路径作为文件名的一部分输入(比如 ,如果那个文件是在当前文件夹的上级文件夹里,那么你可以输入 ../myfile.dna 作为文件名 )。如果你不知道什么是 “文件夹”,或者什么叫“上级”,那么你就是一个只知道点鼠标并且希望会导致一堆文件名会神奇般地出现的新人类 (典型地,这样的人根本不知道文件在他她的系统中的什么地方,并且把他她们的文件系统弄得很乱 )如果是这样的话,你应该找个人跟你解释一下什么叫文件夹。

在一个Unix 或者Linux 系统上运行这些程序

以小写字符输入程序的名字(比如 dnaml )。要在程序正在运行的时候终结它 ,就按Control-C(先按住 Ctrl 键,再按 C )。

在某些系统中,你可能需要在程序名之前输入 ./ ,那样的话,上面的例子就是./dnaml。这个主要是因 为有些时候用户的PATH 中不包含当前目录 ,这通常是出于安全性考虑。

在后台运行或者通过一个脚本文件来控制这些程序的运行

在运行这些程序时,你可能会想要让它们在后台运行,这样你就可能去做别的东西咯 。在那些有窗口环 境的系统中 ,可以将它们放到单独的窗口中去执行,而像Unix 和Linux 的 nice 命令那样的命令可以用来让它们拥有更低的优先级 ,这样它们就不会干扰其它窗口中的交互式的程序咯。这一部分的说明会假设你使用一个Windows 系统或者一个Unix /Linux 系统。假如某些命令只在一个系统上有效而在另 一个系统上无效,那么我会提示。Mac OS X实际上是Unix(有木有!有木有!),所以你可以按照Unix 系统的方法去操作,必要的时候可以使用一个终端窗口。

如果没有窗口环境的话,那么在Unix 或者Linux 系统中,你可以在运行程序的时候在后面跟上一个 & 符号,这样就可以将这个任务(job)放到后台去运行咯。你需要将对那个交互式菜单进行响应的全部内容放到一个文件里 ,并且告诉那个后台任务要从那个文件里面读取它的输入。

在Windows 系统中没有&nice 命令,但是输入输出的重定向和脚本文件在命令提示符(Command)窗口里能够很好地工作。一个脚本文件可以通过单击它的图标或者在一个命令提示符窗口中输入它的名字的方式来调用 。一个脚本文件必须以 .bat 作为扩展名 ,比如 foofile.bat 。你可以在一个命令提示符窗口中输入批处理文件的文件名 (比如 foofile )来运行它,不用带 .bat

下面是一个对于Windows、Linux 或者在 Mac OS X 上使用终端(Terminal)窗口的例子:假设你想在后台运行Dnapars sequences.dat 文件中取得输入数据 将交互式的输出写入到 screenout 文件中 使用 input 文件来存储交互式的输入内容 input 文件只需要包含 2行:

sequences.dat

Y

这些就是你在以交互式的方式运行程序的时候会输入的内容,第一行是当程序找不到 infile 文件时对它的文件名请求的响应,第二行是对菜单进行响应。

要将程序放在后台运行的话,在Unix 或者Linux 系统中只需要简单地执行以下命令:

dnapars < input > screenout &

这样的话,就会启动这个程序,并且将 input 文件中的内容当作输入响应,将交互式的输出写入到 screenout 文件中 。这次运行也会创建常规的输出文件和树文件 (记住 ,如果你在这个程序正在后台运行的时候 ,在同一个目录中启动任何其它的 PHYLIP 程序,那么可能会导致一个程序的输出覆盖咯另一个程序的输出 )。

如果你想赋予这个程序比较低的优先级 便让它不要干扰其它工作,并且在你的Unix 或者Linux 系统里面有伯克利(Berkeley)Unix 类型的任务控制工具的话(通常会有的),那么你可以使用 nice 命令:

nice +10 dnapars < input > screenout &

这条命令会降低当前运行的这个程序的优先级。如果还想计时 ,并且将计时信息放到 screenout 文件 的末尾的话,你可以这样做:

nice +10 ( time dnapars < input ) >& screenout &

这条命令,我不解释。

在Unix 或者Linux 系统上,你还可以试试把交互式的输出内容斗转星移到黑洞文件 /dev/null ,这 样就不用管它咯 (但是那样的话你也没法去看它以 便搞清楚 到底是哪里出咯问题)。 如果你发现不能创建太大的文件,那么你可以试试关掉所启动的程序的某些选项。

如果你要一次性运行多个程序,比如说 ,要使用 Seqboot 、Dnapars(比如说哈) 和Consense 来做一个自举分析 ,你可以用个编辑器来创建一个包含以下命令的脚本:

seqboot < input1 > screenout

mv outfile infile

dnapars < input2 >> screenout

mv outtree intree

consense < input3 >> screenout

上面这个是Unix 或者Linux 版本--在Windows 版本里,对文件的重命名和将输出附加到 screenout 文件中去的操作是用另外的方法来做的

在Unix 或者Linux 系统中,脚本文件可以起像 foofile 这样的名字;在系统中,脚本文件的名字会是像 foofile.bat 这样的。

在Unix 或者Linux 系统上,必须使用 chmod +x foofile 命令再加上 rehash 命令来给这个脚本赋予可执行权限。 foofile 所控制的任务可在Unix 或者Linux 系统上使用以下命令来在后台运行

foofile &

在Windows 系统上,可单击这个脚本文件的图标以运行它。它的图标上会有一个小小的齿轮符号。

注意,你还必须在单独的文件 input1 、input2 input3 中为Seqboot(包括随机数种子) 、Dnapars 和Consense 准备好交互式的输入命令。还要注意 ,当 PHYLIP 中的程序在试图打开一个新的输出文件 (比如 outfile outtree 或者 plotfile )时,如果发现那个文件已经存在咯,那么它们会向你询问:覆盖它 、写到另一个文件里、将输出内容附加到那个文件的末尾、还是什么都不写干脆退出?这就意味着 ,在写脚本文件时,要注意搞清楚会不会有这种提示冒出来。你必须预先知道那个文件是不是存在 。你可能会在脚本文件中加上一些语句 来测试输出文件是不是已经存在 ,如果存在的话就删除它 ,比如说 ,在Unix 、Linux 或者Mac OS X 系统上可写上这样的指令:

if test -e fubarfile

then

rm fubarfile

fi

你甚至还可以加上一条指令来创建一个拥有那个名字的目录,这样你就可以确信它真的存在 !无论怎样,你 都会知道是否要在你的响应文件中对覆盖已有输出文件的问题进行回答咯

准备输入文件

PHYLIP 中的程序所使用的输入文件必须单独准备-在PHYLIP 中没有数据编辑器。你可以自己使用一个 文字处理器 (或者文本编辑器)来准备输入文件,或者可以使用一个产生PHYLIP 格式的输出的程序。

像 ClustalW 这样的序列比对程序一般都提供咯输出PHYLIP 格式的文件的选项,而某些其它的系统发生推论程序 ,例如MacClade 和TreeView,都能生成PHYLIP 格式的文件。

注意,一定要确保那些输入文件是“纯文本”或者"ASCII"格式的。这就意味着它们只包含可打印的ASCII/ISO 字符,不包含任何不可打印的字符。很多文字处理器 ,比如说微软的Word,都会以一种包含不可打印字符的格式来保存文件,除非是你叫它们不要这么做 。在微软的Word 及类似的字处理器中 ,如果是你第一次编辑某个文件,那么当你执行文件菜单中的保存命令时 ,程序实际上会执行一个另存为命令 ,询问你要以什么格式来保存那个文件。

  • •.如果你使用的是微软的Word,那么就选择纯文本 Plain Text )。将会弹出一个对话框 (或者,在Mac OS X 版本的Word 中,会出现一个选项( Option )按钮 ,你可以在那里选择 US-ASCII 选项。那些以 Western European (西欧) 开头的选项应当也没问题 。而其它的编码就不见得有效咯

  • •.如果你使用的是写字板(WordPad),那么就选择文本文档( Text Document (*.txt) 不要 选择 Unicode 文本文档 Unicode Text Document )。

  • •.如果你使用的是记事本(Notepad),那么就选择文本文档 Text Document )再选择 ANSI 编码,不要选择 Unicode 或者 UTF8 编码。

下一次你编辑这个文件时,使用保存命令,程序应当使用那些现有的设置而不再询问你 。如果本软件包中的程序无法读取你搞出来的输入文件 ,那么你就检查一下你是否把这些选项搞正确咯 。执行文件菜单中的另存为命令 ,再做出正确的设置。

文本编辑器,比如Unix 和Linux 系统上的vi 和emacs 编辑器、Mac OS 上的 SimpleText 、或者是 pine 邮件程序自带的编辑器 pico ,都会将输出文件保存为纯文本格式,所以不会引起问题。

输入文件的格式在下面说明,另外你还应当阅读与你将要使用的数据及程序相关的PHYLIP 文档,因为在那里能找到更详细的说明。

输入和输出文件

对于大部分PHYLIP 程序来说,都是这样的 :信息从一堆输入文件里来 ,到一堆输出文件中去 (下面这个图是原作者用纯文本画的,需要用等宽字体来看)

-------------------

| |

infile ---------> | |

| |

intree ---------> | | -----------> outfile

| |

weights --------> | program | -----------> outtree

| |

categories -----> | | -----------> plotfile

| |

fontfile -------> | |

| |

-------------------

这些程序通过显示出一个菜单来与用户交互。除咯用户在菜单中进行的选择以外,它们从文件中读取它们其它的所有输入信息 。这些文件都有默认的名字。程序会尝试着找到一个叫默认名字的文件-如果没找到,它会让用户提供一个文件名。输入数据,例如DNA 序列的默认文件名是 infile 。如果用户提供的是一个树 ,那么默认文件名是 intree 。基因型的权重是在文件 weights ,画树的程序需要数字化的字体 ,那是放在文件 fontfile 中(这些都是默认名字)。

例如,如果Dnaml 需要文件infile 却没找到,那么它会输出这条消息

dnaml: can't find input file "infile"
Please enter a new file name>

这很简单,就是要你输入那个输入文件的名字。

数据文件的格式

我一直试图保持一个固定的输入和输出文件格式。对于简约法、兼容性法和最大似然法的程序,不包括距离向量程序,最简单的输入数据是这样的

6 13

Archaeopt CGATGCTTAC CGC

HesperorniCGTTACTCGT TGT

BaluchitheTAATGTTAAT TGT

B. virginiTAATGTTCGT TGT

BrontosaurCAAAACCCAT CAT

B.subtilisGGCAGCCAAT CAC

输入文件的第一行中写的是物种和字符(在这个例子中是位点)的个数。这 2个字段是以随意格式写的 ,以空格隔开。接下来是每个物种的信息 ,开头部分是 10个字符的物种名字(其中可以包含空格和标点符号),再接下来就是那个物种的字符串咯 。名字必须与那个物种的数据中的首字符位于同一行 (对于树来说,我会使用"物种"这个术语,因为在某些情况下 ,这些东西会是种群或者单个的基因序列).

名字应当包含10个字符,如果长度不够的话 ,就用空格补满。除咯以下字符之外,其它的可打印ASCII/ISO 字符都允许出现在名字中 :括号 (" ( "和" ) ")、方括号("["和"]")、冒号(":")、分号(";")和逗号(",")。如果你忘记咯用空格将名字补充到 10个字符的长度,那么程序就会因为数据文件的内容没有对齐而出错,最后向你报告一个错误。

注意,在物种名字中,制表符只算做一个字符。如果包含咯制表符的话 ,可能会引起麻烦。可能你看起来那个名字已经有 10个字符咯,但是在程序看来却没有10个字符。如果你使用文字处理器 ,例如Word,来制作数据文件,那么严重建议你检查一下以确保其中没有制表符 。你可以这样检查:在名字中使用方向键来移动光标,看看是不是会突然向前移动 2个或者更多的字符的距离。最好是用空格来填充名字,而不是用制表符

在离散字符程序、DNA 序列程序和蛋白质序列程序中,每个字符都是一个单个的字母或者数字,有些时 候是由空格分隔开的 。在连续性字符程序中 ,它们是带小数点的实数,中间用空格分隔:

Latimeria 2.03 3.457 100.2 0.0 -3.7

对于那些包含咯超过一行的数据的处理方法,分子序列程序和其它程序是不同的 。分子序列程序可以接受 “对齐”或者“交错”格式的数据,在后一种格式中 ,我们首先有一些行给出每个序列的第一部分,接下来又有一些行给出每个序列的下一部分,如此下去。所以 ,序列看起来是这样的:

6 39

Archaeopt CGATGCTTAC CGCCGATGCT

HesperorniCGTTACTCGT TGTCGTTACT

BaluchitheTAATGTTAAT TGTTAATGTT

B. virginiTAATGTTCGT TGTTAATGTT

BrontosaurCAAAACCCAT CATCAAAACC

B.subtilisGGCAGCCAAT CACGGCAGCC

TACCGCCGAT GCTTACCGC

CGTTGTCGTT ACTCGTTGT

AATTGTTAAT GTTAATTGT

CGTTGTTAAT GTTCGTTGT

CATCATCAAA ACCCATCAT

AATCACGGCA GCCAATCAC

注意,在这些序列中,每隔10个位点就有一个空格,这样就更容易读取:任意个空格都可以 。那个用来分隔 2组文字行(包含1-20 位点的那些行和包含21-39 位点的那些行)的空行可以有也可以没有。有一点很重要 ,每个组中 ,所有物种的位点个数都要是相同的(也就是说 ,不可能出现这样的情况:第一个物种的那一行有 20 个碱基 ,第二个物种的那一行有 21 个碱基,而程序竟然正确地运行咯 )。

或者,可以在菜单中选择另一个选项,以让程序按照 “串行化”格式来解释数据,首先是第一个物种的全部数据,接下来是第二个物种的全部字符,如此下去。这也是那些离散字符程序和 基因频率及数量字符程序 读取数据的方式。它们不接受交错格式。

在串行化格式中,字符数据可以延伸到新的行中 (除非是在一个物种名字的中间 ,或者,在连续字符和距离矩阵程序中 ,不能在一个实数的中间换行 )。所以可以写上这样的内容:

Archaeopt 001100
1101

甚至是这样的

Archaeopt
0011001101

但是,要注意,物种名字还是必须整整 10个字符的长度:在上面的例子中 ,"t"的后面必须有一个空格。在任何情况下 ,都可以向字符值中添加空格,所以

Archaeopt 0011001101 0111011100

是允许的

Note that you can convert molecular sequence data between the interleaved and the sequential data formats by using the Rewrite option of the J menu item in Seqboot.

If you make an error in the format of the input file, the programs can sometimes detect that they have been fed an illegal character or illegal numerical value and issue an error message such as BAD CHARACTER STATE:, often printing out the bad value, and sometimes the number of the species and character in which it occurred. The program will then stop shortly after. One of the things which can lead to a bad value is the omission of something earlier in the file, or the insertion of something superfluous, which cause the reading of the file to get out of synchronization. The program then starts reading things it didn't expect, and concludes that they are in error. So if you see this error message, you may also want to look for the earlier problem that may have led to the program becoming confused about what it is reading.

Some options are described below, but you should also read the documentation for the groups of the programs and for the individual programs.

The Menu

The menu is straightforward. It typically looks like this (this one is for Dnapars):

DNA parsimony algorithm, version 3.6

Setting for this run:

U Search for best tree? Yes

S Search option? More thorough search

V Number of trees to save? 10000

J Randomize input order of sequences? No. Use input order

O Outgroup root? No, use as outgroup species 1

T Use Threshold parsimony? No, use ordinary parsimony

N Use Transversion parsimony? No, count all steps

W Sites weighted? No

M Analyze multiple data sets? No

I Input sequences interleaved? Yes

0 Terminal type (IBM PC, ANSI, none)? ANSI

1 Print out the data at start of run No

2 Print indications of progress of run Yes

3 Print out tree Yes

4 Print out steps in each site No

5 Print sequences at all nodes of tree No

6 Write out trees onto tree file? Yes

Y to accept these or type the letter for one to change

If you want to accept the default settings (they are shown in the above case) you can simply type Y followed by pressing on the Enter key. If you want to change any of the options, you should type the letter shown to the left of its entry in the menu. For example, to set a threshold type T. Lower-case letters will also work. For many of the options the program will ask for supplementary information, such as the value of the threshold.

Note the Terminal type entry, which you will find on all menus. It allows you to specify which type of terminal your screen is. The options are an IBM PC screen, an ANSI standard terminal, or none. Choosing zero (0) toggles among these three options in cyclical order, changing each time the 0 option is chosen. If one of them is right for your terminal the screen will be cleared before the menu is displayed. If none works, the none option should probably be chosen. The programs should start with a terminal option appropriate for your computer, but if they do not, you can change the terminal type manually. This is particularly important in program Retree where a tree is displayed on the screen - if the terminal type is set to the wrong value, the tree can look very strange.

The other numbered options control which information the program will display on your screen or on the output files. The option to Print indications of progress of run will show information such as the names of the species as they are successively added to the tree, and the progress of rearrangements. You will usually want to see these as reassurance that the program is running and to help you estimate how long it will take. But if you are running the program "in background" as can be done on multitasking and multiuser systems, and do not have the program running in its own window, you may want to turn this option off so that it does not disturb your use of the computer while the program is running. Note also menu option 3, "Print out tree". This can be useful when you are running many data sets, and will be using the resulting trees from the output tree file. It may be helpful to turn off the printing out of the trees in that case, particularly if those files would be too big.

The Output File

Most of the programs write their output onto a file called (usually) outfile, and a representation of the trees found onto a file called outtree.

The exact contents of the output file vary from program to program and also depend on which menu options you have selected. For many programs, if you select all possible output information, the output will consist of (1) the name of the program and its version number, (2) some of the input information printed out, and (3) a series of phylogenies, some with associated information indicating how much change there was in each character or on each part of the tree. A typical rooted tree looks like this:

+-------------------Gibbon

+----------------------------2

! ! +------------------Orang

! +------4

! ! +---------Gorilla

+-----3 +--6

! ! ! +---------Chimp

! ! +----5

--1 ! +-----Human

! !

! +-----------------------------------------------Mouse

!

+------------------------------------------------Bovine

The interpretation of the tree is fairly straightforward: it "grows" from left to right. The numbers at the forks are arbitrary and are used (if present) merely to identify the forks. For many of the programs the tree produced is unrooted. Rooted and unrooted trees are printed in nearly the same form, but the unrooted ones are accompanied by the warning message:

remember: this is an unrooted tree!

to indicate that this is an unrooted tree and to warn against taking the position of its root too seriously. (Mathematicians still call an unrooted tree a tree, though some systematists unfortunately use the term "network" for an unrooted tree. This conflicts with standard mathematical usage, which reserves the name "network" for a completely different kind of graph). The root of this tree could be anywhere, say on the line leading immediately to Mouse. As an exercise, see if you can tell whether the following tree is or is not a different one from the above:

+-----------------------------------------------Mouse

!

+---------4 +------------------Orang

! ! +------3

! ! ! ! +---------Chimp

---6 +----------------------------1 ! +----2

! ! +--5 +-----Human

! ! !

! ! +---------Gorilla

! !

! +-------------------Gibbon

!

+-------------------------------------------Bovine

remember: this is an unrooted tree!

(it is not different). It is important also to realize that the lengths of the segments of the printed tree may not be significant: some may actually represent branches of zero length, in the sense that there is no evidence that those branches are nonzero in length. Some of the diagrams of trees attempt to print branches approximately proportional to estimated branch lengths, while in others the lengths are purely conventional and are presented just to make the topology visible. You will have to look closely at the documentation that accompanies each program to see what it presents and what is known about the lengths of the branches on the tree. The above tree attempts to represent branch lengths approximately in the diagram. But even in those cases, some of the smaller branches are likely to be artificially lengthened to make the tree topology clearer. Here is what a tree from Dnapars looks like, when no attempt is made to make the lengths of branches in the diagram proportional to estimated branch lengths:

+--Human

+--5

+--4 +--Chimp

! !

+--3 +-----Gorilla

! !

+--2 +--------Orang

! !

+--1 +-----------Gibbon

! !

--6 +--------------Mouse

!

+-----------------Bovine

remember: this is an unrooted tree!

When a tree has branch lengths, it will be accompanied by a table showing for each branch the numbers (or names) of the nodes at each end of the branch, and the length of that branch. For the first tree shown above, the corresponding table is:

Between And Length Approx. Confidence Limits

------- --- ------ ------- ---------- ------

1 Bovine 0.90216 ( 0.50346, 1.30086) **

1 Mouse 0.79240 ( 0.42191, 1.16297) **

1 2 0.48553 ( 0.16602, 0.80496) **

2 3 0.12113 ( zero, 0.24676) *

3 4 0.04895 ( zero, 0.12668)

4 5 0.07459 ( 0.00735, 0.14180) **

5 Human 0.10563 ( 0.04234, 0.16889) **

5 Chimp 0.17158 ( 0.09765, 0.24553) **

4 Gorilla 0.15266 ( 0.07468, 0.23069) **

3 Orang 0.30368 ( 0.18735, 0.41999) **

2 Gibbon 0.33636 ( 0.19264, 0.48009) **

* = significantly positive, P < 0.05

** = significantly positive, P < 0.01

Ignoring the asterisks and the approximate confidence limits, which will be described in the documentation file for Dnaml, we can see that the table gives a more precise idea of what the lengths of all the branches are. Similar tables exist in distance matrix and likelihood programs, as well as in the parsimony programs Dnapars and Pars.

Some of the parsimony programs in the package can print out a table of the number of steps that different characters (or sites) require on the tree. This table may not be obvious at first. A typical example looks like this:

steps in each site:

0 1 2 3 4 5 6 7 8 9

*-----------------------------------------

0! 2 2 2 2 1 1 2 2 1

10! 1 2 3 1 1 1 1 1 1 2

20! 1 2 2 1 2 2 1 1 1 2

30! 1 2 1 1 1 2 1 3 1 1

40! 1

The numbers across the top and down the side indicate which site is being referred to. Thus site 23 is column "3" of row "20" and has 1 step in this case.

There are many other kinds of information that can appear in the output file, They vary from program to program, and we leave their description to the documentation files for the specific programs.

The Tree File

In output from most programs, a representation of the tree is also written into the tree file outtree. The tree is specified by nested pairs of parentheses, enclosing names and separated by commas. We will describe how this works below. If there are any blanks in the names, these must be replaced by the underscore character "_". Trailing blanks in the name may be omitted. The pattern of the parentheses indicates the pattern of the tree by having each pair of parentheses enclose all the members of a monophyletic group. The tree file could look like this:

((Mouse,Bovine),(Gibbon,(Orang,(Gorilla,(Chimp,Human)))));

In this tree the first fork separates the lineage leading to Mouse and Bovine from the lineage leading to the rest. Within the latter group there is a fork separating Gibbon from the rest, and so on. The entire tree is enclosed in an outermost pair of parentheses. The tree ends with a semicolon. In some programs such as Dnaml, Fitch, and Contml, the tree will be unrooted. An unrooted tree should have its bottommost fork have a three-way split, with three groups separated by two commas:

(A,(B,(C,D)),(E,F));

Here the three groups at the bottom node are A, (B,C,D), and (E,F). The single three-way split corresponds to one of the interior nodes of the unrooted tree (it can be any interior node of the tree). The remaining forks are encountered as you move out from that first node. In newer programs, some are able to tolerate these other forks being multifurcations (multi-way splits). You should check the documentation files for the particular programs you are using to see in which of these forms you can expect the user tree to be in. Note that many of the programs that actually estimate an unrooted tree (such as Dnapars) produce trees in the treefile in rooted form! This is done for reasons of arbitrary internal bookkeeping. The placement of the root is arbitrary. We are working toward having all programs be able to read all trees, whether rooted or unrooted, multifurcating or bifurcating, and having them do the right thing with them. But this is a long-term goal and it is not yet achieved.

For programs that infer branch lengths, these are given in the trees in the tree file as real numbers following a colon, and placed immediately after the group descended from that branch. Here is a typical tree with branch lengths:

((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,

bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,

seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);

Note that the tree may continue to a new line at any time except in the middle of a name or the middle of a branch length, although in trees written to the tree file this will only be done after a comma.

These representations of trees are a subset of the standard adopted on 24 June 1986 at the annual meetings of the Society for the Study of Evolution by an informal committee (its final session in Newick's lobster restaurant - hence its name, the Newick standard) consisting of Wayne Maddison (author of MacClade), David Swofford (PAUP), F. James Rohlf (NTSYS-PC), Chris Meacham (COMPROB and the original PHYLIP tree drawing programs), James Archie, William H.E. Day, and me. This standard is a generalization of PHYLIP's format, itself based on a well-known representation of trees in terms of parenthesis patterns which is due to the famous mathematician Arthur Cayley, and which has been around for over a century. The standard is now employed by most phylogeny computer programs but unfortunately has yet to be decribed in a formal published description. Other descriptions by me and by Gary Olsen can be accessed using the Web at:

http://evolution.gs.washington.edu/phylip/newicktree.html

The Options and How To Invoke Them

Most of the programs allow various options that alter the amount of information the program is provided or what is done with the information. Options are selected in the menu.

Common options in the menu

A number of the options from the menu, the U (User tree), G (Global), J (Jumble), O (Outgroup), W (Weights), T (Threshold), M (multiple data sets), and the tree output options, are used so widely that it is best to discuss them in this document.

The U (User tree) option. This option toggles between the default setting, which allows the program to search for the best tree, and the User tree setting, which reads a tree or trees ("user trees") from the input tree file and evaluates them. The input tree file's default name is intree. In many cases the programs will also tolerate having the trees be preceded by a line giving the number of trees:

((Alligator,Bear),((Cow,(Dog,Elephant)),Ferret));

((Alligator,Bear),(((Cow,Dog),Elephant),Ferret));

((Alligator,Bear),((Cow,Dog),(Elephant,Ferret)));

An initial line with the number of trees was formerly required, but this now can be omitted. Some programs require rooted trees, some unrooted trees, and some can handle multifurcating trees. You should read the documentation for the particular program to find out which it requires. Program Retree can be used to convert trees among these forms (on saving a tree from Retree, you are asked whether you want it to be rooted or unrooted).

In using the user tree option, check the pattern of parentheses carefully. The programs do not always detect whether the tree makes sense, and if it does not there will probably be a crash (hopefully, but not inevitably, with an error message indicating the nature of the problem). Trees written out by programs are typically in the proper form.

The G (Global) option. In the programs which construct trees (except for Neighbor, the "...penny" programs and Clique, and of course the "...move" programs where you construct the trees yourself), after all species have been added to the tree a rearrangements phase ensues. In most of these programs the rearrangements are automatically global, which in this case means that subtrees will be removed from the tree and put back on in all possible ways so as to have a better chance of finding a better tree. Since this can be time consuming (it roughly triples the time taken for a run) it is left as an option in some of the programs, specifically Contml, Fitch, and Dnaml. In these programs the G menu option toggles between the default of local rearrangement and global rearrangement. The rearrangements are explained more below.

The J (Jumble) option. In most of the tree construction programs (except for the "...penny" programs and Clique), the exact details of the search of different trees depend on the order of input of species. In these programs J option enables you to tell the program to use a random number generator to choose the input order of species. This option is toggled on and off by selecting option J in the menu. The program will then prompt you for a "seed" for the random number generator. The seed should be an integer between 1 and 232-3 (which is 4,294,967,293), and should be of form 4n+1, which means that it must give a remainder of 1 when divided by 4. This can be judged by looking at the last two digits of the number (for example, in the upper limit given above, the last two digits are 93, which is of form 4n+1. Each different seed leads to a different sequence of addition of species. By simply changing the random number seed and re-running the programs one can look for other, and better trees. If the seed entered is not odd, the program will not proceed, but will prompt for another seed.

The Jumble option also causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run).

Some people have asked what are good values of the random number seed. The random number seed is used to start a process of choosing "random" (actually pseudorandom) numbers, which behave as if they were unpredictably randomly chosen between 0 and 232-1 (which is 4,294,967,295). You could put in the number 133 and find that the next random number was 221,381,825. As they are effectively unpredictable, there is no such thing as a choice that is better than any other, provided that the numbers are of the form 4n+1. However if you re-use a random number seed, the sequence of random numbers that result will be the same as before, resulting in exactly the same series of choices, which may not be what you want.

The O (Outgroup) option. This specifies which species is to have the root of the tree be on the line leading to it. For example, if the outgroup is a species "Mouse" then the root of the tree will be placed in the middle of the branch which is connected to this species, with Mouse branching off on one side of the root and the lineage leading to the rest of the tree on the other. This option is toggled on and off by choosing O in the menu (the alphabetic character O, not the digit 0). When it is on, the program will then prompt for the number of the outgroup (the species being taken in the numerical order that they occur in the input file). Responding by typing 6 and then an Enter character indicates that the sixth species in the data (the 6th in the first set of data if there are multiple data sets) is taken as the outgroup. Outgroup-rooting will not be attempted if the data have already established a root for the tree from some other consideration, and may not be if it is a user-defined tree, despite your invoking the option. Thus programs such as Dollop that produce only rooted trees do not allow the Outgroup option. It is also not available in Kitsch, Dnamlk, or Clique. When it is used, the tree as printed out is still listed as being an unrooted tree, though the outgroup is connected to the bottommost node so that it is easy to visually convert the tree into rooted form.

The T (Threshold) option. This sets a threshold forn the parsimony programs such that if the number of steps counted in a character is higher than the threshold, it will be taken to be the threshold value rather than the actual number of steps. The default is a threshold so high that it will never be surpassed (in which case the steps whill simply be counted). The T menu option toggles on and off asking the user to supply a threshold. The use of thresholds to obtain methods intermediate between parsimony and compatibility methods is described in my 1981b paper. When the T option is in force, the program will prompt for the numerical threshold value. This will be a positive real number greater than 1. In programs Mix, Move, Penny, Protpars, Dnapars, Dnamove, and Dnapenny, do not use threshold values less than or equal to 1.0, as they have no meaning and lead to a tree which depends only on considerations such as the input order of species and not at all on the character state data! In programs Dollop, Dolmove, and Dolpenny the threshold should never be 0.0 or less, for the same reason. The T option is an important and underutilized one: it is, for example, the only way in this package (except for program Dnacomp) to do a compatibility analysis when there are missing data. It is a method of de-weighting characters that evolve rapidly. I wish more people were aware of its properties.

The M (Multiple data sets) option. In menu programs there is an M menu option which allows one to toggle on the multiple data sets option. The program will ask you how many data sets it should expect. The data sets have the same format as the first data set. Here is a (very small) input file with two five-species data sets:

5 6

Alpha CCACCA

Beta CCAAAA

Gamma CAACCA

Delta AACAAC

Epsilon AACCCA

5 6

Alpha CACACA

Beta CCAACC

Gamma CAACAC

Delta GCCTGG

Epsilon TGCAAT

The main use of this option will be to allow all of the methods in these programs to be bootstrapped. Using the program Seqboot one can take any DNA, protein, restriction sites, gene frequency or binary character data set and make multiple data sets by bootstrapping. Trees can be produced for all of these using the M option. They will be written on the tree output file if that option is left in force. Then the program Consense can be used with that tree file as its input file. The result is a majority rule consensus tree which can be used to make confidence intervals. The present version of the package allows, with the use of Seqboot and Consense and the M option, bootstrapping of many of the methods in the package.

Programs Dnaml, Dnapars and Pars can also take multiple weights instead of multiple data sets. They can then do bootstrapping by reading in one data set, together with a file of weights that show how the characters (or sites) are reweighted in each bootstrap sample. Thus a site that is omitted in a bootstrap sample has effectively been given weight 0, while a site that has been duplicated has effectively been given weight 2. Seqboot has a menu selection to produce the file of weights information automatically, instead of producing a file of multiple data sets. It can be renamed and used as the input weights file.

The W (Weights) option. This signals the program that, in addition to the data set, you want to read in a series of weights that tell how many times each character is to be counted. If the weight for a character is zero (0) then that character is in effect to be omitted when the tree is evaluated. If it is (1) the character is to be counted once. Some programs allow weights greater than 1 as well. These have the effect that the character is counted as if it were present that many times, so that a weight of 4 means that the character is counted 4 times. The values 0-9 give weights 0 through 9, and the values A-Z give weights 10 through 35. By use of the weights we can give overwhelming weight to some characters, and drop others from the analysis. In the molecular sequence programs only two values of the weights, 0 or 1 are allowed.

The weights are used to analyze subsets of the characters, and also can be used for resampling of the data as in bootstrap and jackknife resampling. For those programs that allow weights to be greater than 1, they can also be used to emphasize information from some characters more strongly than others. Of course, you must have some rationale for doing this.

The weights are provided as a sequence of digits. Thus they might be

10011111100010100011110001100

The weights are to be provided in an input file whose default name is weights. The weights in it are a simple string of digits. Blanks in the weightfile are skipped over and ignored, and the weights can continue to a new line. In programs such as Seqboot that can also output a file of weights, the input weights have a default file name of inweights, and the output file name has a default file name of outweights.

Weights can be used to analyze different subsets of characters (by weighting the rest as zero). Alternatively, in the discrete characters programs they can be used to force a certain group to appear on the phylogeny (in effect confining consideration to only phylogenies containing that group). This is done by adding an imaginary character that has 1's for the members of the group, and 0's for all the other species. That imaginary character is then given the highest weight possible: the result will be that any phylogeny that does not contain that group will be penalized by such a heavy amount that it will not (except in the most unusual circumstances) be considered. Of course, the new character brings extra steps to the tree, but the number of these can be calculated in advance and subtracted out of the total when reporting the results. This use of weights is an important one, and one sadly ignored by many users who could profit from it. In the case of molecular sequences we cannot use weights this way, so that to force a given group to appear we have to add a large extra segment of sites to the molecule, with (say) A's for that group and C's for every other species.

The option to write out the trees into a tree file. This specifies that you want the program to write out the tree not only on its usual output, but also onto a file in nested-parenthesis notation (as described above). This option is sufficiently useful that it is turned on by default in all programs that allow it. You can optionally turn it off if you wish, by typing the appropriate number from the menu (it varies from program to program). This option is useful for creating tree files that can be directly read into the programs, including the consensus tree and tree distance programs, and the tree plotting programs.

The output tree file has a default name of outtree.

The ( 0 ) terminal type option . (This is the digit 0, not the alphabetic character O). The program will default to one particular assumption about your terminal (ANSI in the case of Linux, Unix, or Mac OS X, and IBM PC in the case of Windows). You can alternatively select it to be either an IBM PC, or nothing. This affects the ability of the programs to clear the screen when they display their menus, and the graphics characters used to display trees in the programs Dnamove, Move, Dolmove, and Retree. In the case of Windows, the screen will clear properly with either the IBM PC or the ANSI settings, but the graphics characters needed by Move, Dnamove, Dolmove, or Retree will display correctly only with the IBM PC setting.

The Algorithm for Constructing Trees

All of the programs except Factor, Dnadist, Gendist, Dnainvar, Seqboot, Contrast, Retree, and the plotting and consensus tree programs act to construct an estimate of a phylogeny. Move, Dolmove, and Dnamove let you construct it yourself by hand. All of the rest but Neighbor, the "...penny" programs and Clique make use of a common approach involving additions and rearrangements. They are trying to minimize or maximize some quantity over the space of all possible evolutionary trees. Each program contains a part that, given the topology of the tree, evaluates the quantity that is being minimized or maximized. The straightforward approach would be to evaluate all possible tree topologies one after another and pick the one which, according to the criterion being used, is best. This would not be possible for more than a small number of species, since the number of possible tree topologies is enormous. A review of the literature on the counting of evolutionary trees will be found one of my papers (Felsenstein, 1978a) and in my book (Felsenstein, 2004, chapter 3).

Since we cannot search all topologies, these programs are not guaranteed to always find the best tree, although they seem to do quite well in practice. The strategy they employ is as follows: the species are taken in the order in which they appear in the input file. The first two (in some programs the first three) are taken and a tree constructed containing only those. There is only one possible topology for this tree. Then the next species is taken, and we consider where it might be added to the tree. If the initial tree is (say) a rooted tree with two species and we want the resulting three-species tree to be a bifurcating tree, there are only three places where we could add the third species. Each of these is tried, and each time the resulting tree is evaluated according to the criterion. The best one is chosen to be the basis for further operations. Now we consider adding the fourth species, again at each of the five possible places that would result in a bifurcating tree. Again, the best of these is accepted. This is usually known as the Sequential Addition strategy.

Local rearrangements

The process continues in this manner, with one important exception. After each species is added, and before the next is added, a number of rearrangements of the tree are tried, in an effort to improve it. The algorithms move through the tree, making all possible local rearrangements of the tree. A local rearrangement involves an internal segment of the tree in the following manner. Each internal segment of the tree is of this form (where T1, T2, and T3 are subtrees - parts of the tree that can contain further forks and tips):

T1 T2 T3

/ /

/ /

/ /

/ /

* /

* /

* /

* /

*

!

!

the segment we are discussing being indicated by the asterisks. A local rearrangement consists of switching the subtrees T1 and T3 or T2 and T3, so as to obtain one of the following:

T3 T2 T1 T1 T3 T2

/ / / /

/ / / /

/ / / /

/ / / /

/ /

/ /

/ /

/ /

! !

! !

! !

Each time a local rearrangement is successful in finding a better tree, the new arrangement is accepted. The phase of local rearrangements does not end until the program can traverse the entire tree, attempting local rearrangements, without finding any that improve the tree.

This strategy of adding species and making local rearrangements will look at about (n-1)x(2n-3) different topologies, though if rearrangements are frequently successful the number may be larger. I have been describing the strategy when rooted trees are being considered. For unrooted trees there is a precisely similar strategy, though the first tree constructed may be a three-species tree and the rearrangements may not start until after the addition of the fifth species.

These local rearrangements have come to be called Nearest Neighbor Interchanges (NNIs) in the phylogeny literature.

Though we are not guaranteed to have found the best tree topology, we are guaranteed that no nearby topology (i. e. none accessible by a single local rearrangement) is better. In this sense we have reached a local optimum of our criterion. Note that the whole process is dependent on the order in which the species are present in the input file. We can try to find a different and better solution by reordering the species in the input file and running the program again (or, more easily, by using the J option). If none of these attempts finds a better solution, then we have some indication that we may have found the best topology, though we can never be certain of this.

Note also that a new topology is never accepted unless it is better than the previous one, so that the rearrangement process can never fall into an endless loop. This is also the way ties in our criterion are resolved, namely by sticking with the tree found first. However, the tree construction programs other than Clique, Contml, Fitch, and Dnaml do keep a record of all trees found that are tied with the best one found. This gives you some immediate idea of which parts of the tree can be altered without affecting the quality of the result.

Global rearrangements

A feature of most of the programs, such as Protpars, Dnapars, Dnacomp, Dnaml, Dnamlk, Restml, Kitsch, Fitch, Contml, Mix, and Dollop, is "global" optimization of the tree. In four of these (Contml, Fitch, Dnaml and Dnamlk) this is an option, G. In the others it automatically applies. When it is present there is an additional stage to the search for the best tree. Each possible subtree is removed from the tree from the tree and added back in all possible places. This process continues until all subtrees can be removed and added again without any improvement in the tree. The purpose of this extra rearrangement is to make it less likely that one or more a species gets "stuck" in a suboptimal region of the space of all possible trees. The use of global optimization results in approximately a tripling (3 x ) of the run-time, which is why I have left it as an option in some of the slower programs.

What PHYLIP calls "global" rearrangements are more properly called SPR (subtree pruning and regrafting) by Swofford et. al. (1996) as distinct from the NNI (nearest neighbor interchange) rearrangements that PHYLIP also uses, and the TBR (tree bisection and reconnection) rearrangements that it does not use. My book (Felsenstein, 2004, chapter 4) contains a review of work on these and other rearrangements and search methods.

The programs doing global optimization print out a dot "." after each group is removed and re-added to the tree, to give the user some sign that the rearrangements are proceeding. A new line of dots is started whenever a new round of global rearrangements is started following an improvement in the tree. On the line before the dots are printed there is printed a bar of the form "!---------------!" to show how many dots to expect. The dots will not be printed out at a uniform rate, but the later dots, which represent removal of larger groups from the tree and trying them consequently in fewer places, will print out more quickly. With some compilers each row of dots may not be printed out until it is complete.

It should be noted that Penny, Dolpenny, Dnapenny and Clique use a more sophisticated strategy of "depth-first search" with a "branch and bound" search method that guarantees that all of the best trees will be found. In the case of Penny, Dolpenny and Dnapenny there can be a considerable sacrifice of computer time if the number of species is greater than about ten: it is a matter for you to consider whether it is worth it for you to guarantee finding all the most parsimonious trees, and that depends on how much free computer time you have! Clique finds all largest cliques, and does so without undue burning of computer time. Although all of these problems that have been investigated fall into the category of "NP-hard" problems that in effect do not have a rapid solution, the cases that cause this trouble for the largest-cliques algorithm in Clique apparently are not biologically realistic and do not occur in actual data.

Multiple jumbles

As just mentioned, for most of these programs the search depends on the order in which the species are entered into the tree. Using the J (Jumble) option you can supply a random number seed which will allow the program to put the species in in a random order. Jumbling can be done multiple times. For example, if you tell the program to do it 10 times, it will go through the tree-building process 10 times, each with a different random order of adding species. It will keep a record of the trees tied for best over the whole process. In other words, it does not just record the best trees from each of the 10 runs, but records the best ones overall. Of course this is slow, taking 10 times longer than a single run. But it does give us a much greater chance of finding all of the most parsimonious trees. In the terminology of Maddison (1991) it can find different "islands" of trees. The present algorithms do not guarantee us to find all trees in a given "island" from a single run, so multiple runs also help explore those "islands" that are found.

Saving multiple tied trees

For the parsimony and compatibility programs, one can have a perfect tie between two or more trees. In these programs these trees are all saved. For the newer parsimony programs such as Dnapars and Pars, global rearrangement is carried out on all of these tied trees. This can be turned off in the menu.

For trees with criteria which are real numbers, such as the distance matrix programs Fitch and Kitsch, and the likelihood programs Dnaml, Dnamlk, Contml, and Restml, it is difficult to get an exact tie between trees. Consequently these programs save only the single best tree (even though the others may be only a tiny bit worse).

Strategy for finding the best tree

In practice, it is advisable to use the Jumble option to evaluate many different orderings of the input species. It is advisable to use the Jumble option and specify that it be done many times (as many as different orderings of the input species). (This is usually not necessary when bootstrapping, though the programs will then default to doing it once to avoid artifacts caused by the order in which species are added to the tree.)

People who want a magic "black box" program whose results they do not have to question (or think about) often are upset that these programs give results that are dependent on the order in which the species are entered in the data. To me this property is an advantage, for it permits you to try different searches for better trees, simply by varying the input order of species. If you do not use the multiple Jumble option, but do multiple individual runs instead, you can easily decide which to pay most attention to - the one or ones that are best according to the criterion employed (for example, with parsimony, the one out of the runs that results in the tree with the fewest changes).

In practice, in a single run, it usually seems best to put species that are likely to be sources of confusion in the topology last, as by the time they are added the arrangement of the earlier species will have stabilized into a good configuration, and then the last few species will by fitted into that topology. There will be less chance this way of a poor initial topology that would affect all subsequent parts of the search. However, a variety of arrangements of the input order of species should be tried, as can be done if the J option is used, and no species should be kept in a fixed place in the order of input. Note that the results of the "...penny" programs and Clique are not sensitive to the input order of species, and Neighbor is only slightly sensistive to it, so that multiple Jumbling is not possible with those programs. Note also that with global search, which is standard in many programs and in others is an option, each group (including each individual species) will be removed and re-added in all possible positions, so that a species causing confusion will have more chance of moving to a new location than it would without global rearrangement.

A Warning on Interpreting Results

Probably the most important thing to keep in mind while running any of the parsimony or compatibility programs is not to overinterpret the result. Some users treat the set of most parsimonious trees as if it were a confidence interval. If a group appears in all of the most parsimonious trees then they treat it as well established. Unfortunately the confidence interval on phylogenies appears to be much larger than the set of all most parsimonious trees (Felsenstein, 1985b). Likewise, variation of result among different methods will not be a good indicator of the size of the confidence interval. Consider a simple data set in which, out of 100 binary characters, 51 recommend the unrooted tree ((A,B),(C,D)) and 49 the tree ((A,D),(B,C)). Many different methods will all give the same result on such a data set: they will estimate the tree as ((A,B),(C,D)). Nevertheless it is clear that the 51:49 margin by which this tree is favored is not statistically significantly different from 50:50. So consistency among different methods is a poor guide to statistical significance.

General Comments on Adapting
the Package to Different Computer Systems

In the sections following you will find instructions on how to adapt the programs to different computers and compilers. The programs should compile without alteration on most versions of C. They use the "malloc" library or "calloc" function to allocate memory so that the upper limits on how many species or how many sites or characters they can run is set by the system memory available to that memory-allocation function.

In the document file for each program, I have supplied a small input example, and the output it produces, to help you check whether the programs are running properly.

Compiling the programs

If you have not been able to get executables for PHYLIP, you should be able to make your own. This can be easy under Linux and Unix, but more difficult if you have a Macintosh or a Windows system. If you have the latter, we strongly recommend you download and use the Macintosh and Windows executables that we distribute. If you do that, you will not need to have any compiler or to do any compiling. I get a certain number of inquiries each year from confused users who are not sure what a compiler is but think they need one. After downloading the executables they contact me and complain that they did not find a compiler included in the package, and would I please e-mail them the compiler. What they really need to do is use the executables and forget about compiling them.

Some users may also need to compile the programs in order to modify them. The instructions below will help with this.

I will discuss how to compile PHYLIP using one of a number of widely-used compilers. After these I will comment on compiling PHYLIP on other, less widely-used systems.

Unix and Linux

For Unix and Linux (which is Unix in all important functional respects, if not in all legal respects) you must compile PHYLIP yourself. This is usually easy to do yourself. Unix (and Linux) systems generally have a C compiler and have the make utility. We distribute with the PHYLIP source code a Unix-compatible Makefile. We use GNU's make utility, which might be installed on your system as "make" or as "gmake".

However, note that some popular Linux distributions do not include a C compiler in their default configuration. For example, in RedHat Linux version 8, the "Personal Workstation" installation that is the default does not include the C compiler or the X Windows libraries needed to compile PHYLIP. These are available, and can be loaded from the CDROMs in the distribution. The following instructions assume that you have the C compiler and X libraries. If you cannot easily configure your system to include them, you should look into using the RedHat RPM binary distribution, mentioned on the PHYLIP 3.6 web page.

As is mentioned below (under Macintoshes) the Mac OS X operating system is a Unix, and if the X windows windowing system is installed, these Unix instructions will work for it.

After you have finished unpacking the Documentation and Source Code archive, you will find that you have created a folder phylip-3.68 in which there are three folders, called exe, src, and doc. There is also an HTML web page, phylip.html. The exe folder will be empty, src contains the source code files, including the Makefile. Directory doc contains the documentation files.

Enter the src folder. Before you compile, you will want to look at the Makefile and see whether you want to alter the compilation command. We have the default C compiler flags set with no flags. If you have modified the programs, you might want to use the debugging flags "-g". On the other hand, if you are trying to make a fast executable using the GCC compiler, you may want to use the one which is "An optimized one for gcc". In either case, remove the "#" before that CFLAGS command, and place it before the CFLAGS command that was previously in use. There are careful instructions on this in the Makefile. Once you have set up the CFLAGS and DFLAGS statements to be the way you want, to compile all the programs just type:

make install

You will then see the compiling commands as they happen, with occasional warning messages. If these are warnings, rather than errors, they are not too serious. A typical warning would be like this:

dnaml.c:1204: warning: static declaration for re_move follows non-static

After a time the compiler will finish compiling. If you have done a make install the system will then move the executables into the exe folder and also save space by erasing all the relocatable object files that were produced in the process. You should be left with useable executables in the exe folder, and the src folder should be as before. To run the executables, go into the exe folder and type the program name (say dnaml, which you may or may not have to precede by a dot and a slash./). The names of the executables will be the same as the names of the C programs, but without the .c suffix. Thus dnaml.c compiles to make an executable called dnaml.

Our two tree-drawing programs, Drawgram and Drawtree, require an X Windows installation including the Athena Widgets. These are provided with most X Windows installations.

If you see messages that the compilation could not find "Xlib.h" and other, similar functions, this means that some parts of the X Windows development environment is not installed on your system, or is not installed in the default location. Similarly, if you get error messages saying that some files with "Xaw" in the name cannot be found, this means that the Athena Widgets are not installed on your system, or are not installed in the default location.

In either case, you will need to make sure that they are installed properly. If they are there but not found during the compile, change the DFLAGS and DLIBS variables in the Makefile to point to the locations of the header files and libraries, respectively.

Another is that the usual Linux C compiler is the Gnu GCC compiler. In some Linux systems it is not invoked by the command cc but by gcc. You would then need to edit the Makefile to reflect this (see below for comments on that process).

A typical Unix or Linux installation would put the directory phylip-3.68 in /usr/local. The name of the executables directory EXEDIR could be changed to be /usr/local/bin, so that the make install command puts the executables there. If the users have /usr/local/bin in their paths, the programs would be found when their names are typed. The font files font1 through font6 could also be placed there. A batch script containing the lines

ln -s /usr/local/bin/font1 font1

ln -s /usr/local/bin/font2 font2

ln -s /usr/local/bin/font3 font3

ln -s /usr/local/bin/font4 font4

ln -s /usr/local/bin/font5 font5

ln -s /usr/local/bin/font6 font6

could be used to establish links in the user's working directory so that Drawtree and Drawgram would find these font files when users type a name such as font1 when the program asks them for a font file name. The documentation web pages are in subdirectory doc of the main PHYLIP directory, except for one, phylip.html which is in the main PHYLIP directory. It has a table of all of the documentation pages, including this one. If users create a bookmark to that page it can be used to access all of the other documentation pages.

To compile just one program, such as Dnaml, type:

make dnaml

After this compilation, dnaml will be in the src subdirectory. So will some relocatable object code files that were used to create the executable. These have names ending in .o - they can safely be deleted.

If you have problems with the compilation command, you can edit the Makefile. It has careful explanations at its front of how you might want to do so. For example, you might want to change the C compiler name cc to the name of the Gnu C compiler, gcc. This can be done by removing the comment character # from the front of one line, and placing it at the front of a nearby line. How to do so should be clear from the material at the beginning of the Makefile. We have included sample lines for using the gcc compiler and for using the Cygwin Gnu C++ environment on Windows, as well as the default of cc.

We have encountered some problems with the Gnu C Compiler (gcc) on 64-bit Itanium processors when compiled with the the -O 3 optimization level, in our code for generating random numbers.

Some older C compilers (notably the Berkeley C compiler which is included free with some Sun systems) do not adhere to the ANSI C standard (because they were written before it was set down). They have trouble with the function prototypes which are in our programs. We have included an #ifndef preprocessor command to eliminate the problem, if you use the switch -DOLDC when compiling. Thus with these compilers you need only use this in your C flags (in the Makefile) and compilers such as Berkeley C will cause no trouble.

Parallel computers

As parallel computers become more common, the issue of how to compile PHYLIP for them has become more pressing. People have been compiling PHYLIP for vector machines and parallel machines for many years. We have not made a version for parallel machines because there is still no standard parallel programming environment on such machines (or rather, there are many standards, so that one cannot find one that makes a parallel execution version of PHYLIP widely distributable). However symmetric multiprocessing using the MPI Message Passing Interface is spreading rapidly, and we will probably support it in future versions of PHYLIP.

Although the underlying algorithms of most programs, which treat sites independently, should be amenable to vector and parallel processors, there are details of the code which might best be changed. In certain of the programs (Dnaml, Dnamlk, Proml, Promlk) I have put a special comment statement next to the loops in the program where the program will spend most of its time, and which are the places most likely to benefit from parallelization. This comment statement is:

/* parallelize here */

In particular within these innermost loops of the programs there are often scalar quantities that are used for temporary bookkeeping. These quantities, such as sum1, sum2, zz, z1, yy, y1, aa, bb, cc, sum, and denom in procedure makenewv of Dnaml and similar quantities in procedure nuview) are there to minimize the number of array references. For vectorizing and parallelizing compilers it will be better to replace them by arrays so that processing can occur simultaneously.

If you succeed in making a parallel version of PHYLIP we would like to know how you did it. In particular, if you can prepare a web page which describes how to do it for your computer system, we would like to use material from it in our PHYLIP web pages. Please e-mail it to me. We hope to have a set of pages that give detailed instructions on how to make parallel version of PHYLIP on various kinds of machines. Alternatively, if we were given your modified version of the program we might be able to figure out how to make modifications to our source code to allow users to compile the program in a way which makes those modifications.

Other computer systems

As you can see from the variety of different systems on which these programs have been successfully run, there are no serious incompatibility problems with most computer systems. PHYLIP in various past Pascal versions has also been compiled on 8080 and Z80 CP/M Systems, Apple II systems running UCSD Pascal, a variety of minicomputer systems such as DEC PDP-11's and HP 1000's, on 1970's era mainframes such as CDC Cyber systems, and so on. In a later era it was also compiled on IBM 370 mainframes, and of course on DOS and Windows systems and on Macintosh systems. We have gradually accumulated experience on a wider variety of C compilers. If you succeed in compiling the C version of PHYLIP on a different machine or a different compiler, I would like to hear the details so that I can consider including the instructions in a future version of this manual.

Frequently Asked Questions

This set of Frequently Asked Questions, and their answers, is from the PHYLIP web site. A more up-to-date version can be found there, at:

http://evolution.gs.washington.edu/phylip/faq.html

Problems that are encountered

"The program reads my data file and then says it has a memory allocation error!"

This is what tends to happen if there is a problem with the format of the data file, so that the programs get confused and think they need to set aside memory for 1,000,000 species or so. The result is a "memory allocation error" (the error message may say that "the function asked for an inappropriate amount of memory"). Check the data file format against the documentation: make sure that the data files have not been saved in the format of your word processor (such as Microsoft Word) but in a "flat ASCII" or "text only" mode. Note that adding memory to your computer is not the way to solve this problem -- you probably have plenty of memory to run the program once the data file is in the correct format.

"One program makes an output file and then the next program crashes while reading it!"

Did you rename the file? If a program makes a file called outfile, and then the next program is told to use outfile as its input file, terrible things will happen. The second program first opens outfile as an output file, thus erasing it. When it then tries to read from this empty outfile a psychological crisis ensues. The solution is simply to rename outfile before trying to use it as an input file.

"Consense gives wierd branch lengths! How do I get more reasonable ones?"

Consense gives branch lengths which are simply the numbers of replicates that support the branch. This is not a good reflection of how long those branches are estimated to be. The best way to put better branch lengths on a consensus tree is to use it as a User Tree in a program that will estimate branch lengths for it, such as DnaML. You may need to convert it to being an unrooted tree, using Retree, first. If the original program you were using was a program that does not estimate branch lengths, you may instead have to use one that does. You can use a likelihood program, or make some distances between your species (using, for example, Dnadist) and use Fitch to put branch lengths on the user tree. Here is the sequence of steps you should go through:

  1. 1.Take the tree and use Retree to make sure it is Unrooted (just read it into Retree and then save it, specifying Unrooted)

  2. 2.Use the unrooted tree as a User Tree (option U) in one of our programs (such as Dnaml or Fitch). If you use Fitch, you also first need to use one of the distance programs such as Dnadist to compute a set of distances to serve as its input.

  3. 3.Specify that the branch lengths of the tree are not to be used but should be re-estimated. This is actually the default.

"I looked at the tree printed in the output file outfile and it looked wierd. Do I always need to look at it in Drawgram?"

It's possible you are using the wrong font for looking at the tree in the output file. The tree is drawn with dashes and exclamation points. If a proportional font such as Times Roman or Helvetica is used, the tree lines may not connect. Try selecting the whole tree and setting the font to a fixed-width one such as Courier. You may be astounded how much clearer the tree has become.

"DrawTree (or DrawGram) doesn't work: it can't find the font file!"

Six font files, called font1 through font6, are distributed with the executables (and with the source code too). The program looks for a copy of one of them called fontfile. If you haven't made such a copy called fontfile it then asks you for the name of the font file. If they are in the current folder, just type one of font1 through font6. The reason for having the program look for fontfile is so that you can copy your favorite font file, call the copy fontfile, and then it will be found automatically without you having to type the name of the font file each time.

"Can Drawgram draw a scale beside the tree? Print the branch lengths as numbers?"

It can't do either of these. Doing so would make the program more complex, and it is not obvious how to fit the branch length numbers into a tree that has many very short internal branches. If you want these scales or numbers, choose an output plot file format (such as Postscript, PICT or PCX) that can be read by a drawing program such as Adobe Illustrator, Freehand, Canvas, CorelDraw, or MacDraw. Then you can add the scales and branch length numbers yourself by hand. Note the menu option in Drawtree and Drawgram that specifies the tree size to be a given number of centimeters per unit branch length.

"How can I get Drawgram or Drawtree to print the bootstrap values next to the branches?"

When you do bootstrapping and use Consense, it prints the bootstrap values in its output file (both in a table of sets, and on the diagram of the tree which it makes). These are also in the output tree file of Consense. There they are in place of branch lengths. So to get them to be on the output of Drawgram or Drawtree, you must write the tree in the format of a drawing program and use it to put the values in by hand, as mentioned in the answer to the previous question.

"I have an HP laser printer and can't get DrawGram to print on it"

Drawgram and Drawtree produce a plot file (called plotfile): they do not send it to the printer. It is up to you to get the plot file to the printer. If you are running Windows this can probably be done with the Command tool and the command COPY/B PLOTFILE PRN:, unless your printer is a networked printer. The /B is important. If it is omitted the copy command will strip off the highest bit of each byte, which can cause the printing to fail or produce garbage.

"Dnaml won't read the treefile that is produced by Dnapars!"

That's because the Dnapars tree file is a rooted tree, and Dnaml wants an unrooted tree. Try using Retree to change the file to be an unrooted tree file. Our most recent versions of the programs usually automatically convert a rooted tree into an unrooted one as needed. But the programs such as Dnamlk or Dollop that need a rooted tree won't be able to use an unrooted tree.

"In bootstrapping, Seqboot makes too large a file"

If there are 1000 bootstrap replicates, it will make a file 1000 times as long as your original data set. But for many methods there is another way that uses much less file space. You can use Seqboot to make a file of multiple sets of weights, and use those together with the original data set to do bootstrapping.

"In bootstrapping, the output file gets too big."

When running a program such as Neighbor or Dnapars with multiple data sets (or multiple weights) for purposes of bootstrapping, the output file is usually not needed, as it is the output tree file that is used next. You can use the menu of the program to turn off the writing of trees into the output file. The trees will still be written into the output tree file.

"Why don't your programs correctly read the sequence alignment files produced by ClustalW?"

They do read them correctly if you make the right kind. Files from ClustalV or ClustalW whose names end in ".aln" are not in PHYLIP format, but in Clustal's own format which will not work in PHYLIP. You need to find the option to output PHYLIP format files, which ClustalW and ClustalV usually assign the extension .phy.

"Why doesn't Neighbor read my DNA sequences correctly?"

Because it wants to have as input a distance matrix, not sequences. You have to use Dnadist to make the distance matrix first.

How to make it do various things

"How do I bootstrap?"

The general method of bootstrapping involves running Seqboot to make multiple bootstrapped data sets out of your one data set, then running one of the tree-making programs with the Multiple data sets option to analyze them all, then running Consense to make a majority rule consensus tree from the resulting tree file. Read the documentation of Seqboot to get further information. Before, only parsimony methods could be bootstrapped. With this new system almost any of the tree-making methods in the package can be bootstrapped. It is somewhat more tedious but you will find it much more rewarding.

"How do I specify a multi-species outgroup with your parsimony programs?"

It's not a feature but is not too hard to do in many of the programs. In parsimony programs like Mix, for which the W (Weights) and A (Ancestral states) options are available, and weights can be larger than 1, all you need to do is:

(a)

In Mix, make up an extra character with states 0 for all the outgroups and 1 for all the ingroups. If using
Dnapars the ingroup can have (say) G and the outgroup A.

(b)

Assign this character an enormous weight (such as Z for 35) using the W option,
all other characters getting weight 1, or whatever weight they had before.

(c)

If it is available, Use the A (Ancestral states) option to designate that for that new character the state found in the
outgroup is the ancestral state.

(d)

In Mix do not use the O (Outgroup) option.

(e)

After the tree is found, the designated ingroup should have been held together by the fake character. The tree will be
rooted somewhere in the outgroup (the program may or may not have a preference for one place in the outgroup over another).
Make sure that you subtract from the total number of steps on the tree all steps in the new character.

In programs like Dnapars, you cannot use this method as weights of sites cannot be greater than 1. But you do an analogous trick, by adding a largish number of extra sites to the data, with one nucleotide state ("A") for the ingroup and another ("G") for the outgroup. You will then have to use Retree to manually reroot the tree in the desired place.

"How do I force certain groups to remain monophyletic in your parsimony programs?"

By the same method as in the previous question, using multiple fake characters, any number of groups of species can be forced to be monophyletic. In Move, Dolmove, and Dnamove you can specify whatever outgroups you want without going to this trouble.

"How can I reroot one of the trees written out by PHYLIP?"

Use the program Retree. But keep in mind whether the tree inferred by the original program was already rooted, or whether you are free to reroot it without changing its meaning.

"What do I do about deletions and insertions in my sequences?"

The molecular sequence programs will accept sequences that have gaps (the "-" character). They do various things with them, mostly not optimal. Programs such as Dnaml and Dnadist count gaps as equivalent to unknown nucleotides (or unknown amino acids) on the grounds that we don't know what would be there if something were there. This completely leaves out the information from the presence or absence of the gap itself, but does not bias the gapped sequence to be close to or far from other gapped or ungapped sequences. Sequences that share a gap at a site do not tend to cluster together on the tree. So it is not necessary to remove gapped regions from your sequences, unless the presence of gaps indicates that the region is badly aligned. An exception to this is Dnapars, which counts "gap" as if it were a fifth nucleotide state (in addition to A, C, G, and T). Each site counts one change when a gap arises or disappears. The disadvantage of this treatment is that a long gap will be overweighted, with one event per gapped site. So a gap of 10 nucleotides will count as being as much evidence as 10 single site nucleotide substitutions. If there are not overlapping gaps, one way to correct this is to recode the first site in the gap as "-" but make all the others be "?" so the gap only counts as one event.

"How can I produce distances for my data set which has 0's and 1's?"

You can't do it in a simple and general way, for a straightforward reason. Distance methods must correct the distances for superimposed changes. Unless we know specifically how to do this for your particular characters, we cannot accomplish the correction. There are many formulas we could use, but we can't choose among them without much more information. There are issues of superimposed changes, as well as heterogeneity of rates of change in different characters. Thus we have not provided a distance program for 0/1 data. It is up to you to figure out what is an appropriate stochastic model for your data and to find the right distance formulas.

"I have RFLP fragment data: which programs should I use?"

This is a more difficult question than you may imagine. Here is quick tour of the issues:

  • •.You can code fragments as 0 and 1 and use a parsimony program. It is not obvious in advance whether 0 or 1 is ancestral, though it is likely that change in one direction is more likely than change in the other for each fragment. One can use either Wagner parsimony (programs PARS, Mix, Penny or Move) or use Dollo parsimony (Dollop, Dolpenny or Dolmove) with the ancestral states all set as unknown ("?").

  • •.You can use a distance matrix method using the RFLP distance of Nei and Li (1979). Their restriction fragment distance is available in our program RestDist.

  • •.You should be very hesitant to bootstrap RFLP's. The individual fragments do not evolve independently: a single nucleotide substitution can eliminate one fragment and create two (or vice versa).

For restriction sites (rather than fragments) life is a bit easier: they evolve nearly independently so bootstrapping is possible and Restml can be used, as well as restriction sites distances computed in Restdist. Also directionality of change is less ambiguous when parsimony is used. A more complete tour of the issues for restriction sites and restriction fragments is given in chapter 15 of my book (Felsenstein, 2004).

"Why don't your parsimony programs print out branch lengths?"

Well, Dnapars and Pars can. The others have not yet been upgraded to the same level. The longer answer is that it is because there are problems defining the branch lengths. If you look closely at the reconstructions of the states of the hypothetical ancestral nodes for almost any data set and almost any parsimony method you will find some ambiguous states on those nodes. There is then usually an ambiguity as to which branch the change is actually on. Other parsimony programs resolve this in one or another arbitrary fashion, sometimes with the user specifying how (for example, methods that push the changes up the tree as far as possible or down it as far as possible). Our older programs leave it to the user to do this. In Dnapars and PARS we use an algorithm discovered by Hochbaum and Pathria (1997) (and independently by Wayne Maddison) to compute branch lengths that average over all possible placements of the changes. But these branch lengths, as nice as they are, do not correct for mulitple superimposed changes. Few programs available from others currently correct the branch lengths for multiple changes of state that may have overlain each other. One possible way to get branch lengths with nucleotide sequence data is to take the tree topology that you got, use Retree to convert it to be unrooted, prepare a distance matrix from your data using Dnadist, and then use Fitch with that tree as User Tree and see what branch lengths it estimates.

"Why can't your programs handle unordered multistate characters?"

In this 3.6 release there is a program Pars which does parsimony for undordered multistate characters with up to 8 states, plus ?. The other the discrete characters parsimony programs can only handle two states, 0 and 1. This is mostly because I have not yet had time to modify them to do so - the modifications would have to be extensive. Ultimately I hope to get these done. If you have four or fewer states and need a feature that is not in Pars, you could recode your states to look like nucleotides and use the parsimony programs in the molecular sequence section of PHYLIP, or you could use one of the excellent parsimony programs produced by others.

Background information needed:

"What file format do I use for the sequences?"
"How do I use the programs? I can't find any documentation!"

These are discussed in the documentation files. Do you have them? If you have a copy of this page you probably do. They may be in a separate archive from the executables (in which case they are in the Documentation and Sources archives, which you should definitely fetch). Input file formats are discussed in main.html, in sequence.html, distance.html, contchar.html, discrete.html, and the documentation files for the individual programs.

Questions about distribution and citation:

"If I copied PHYLIP from a friend without you knowing, should I try to keep you from finding out?"

No. It is to your advantage and mine for you to let me know. If you did not get PHYLIP "officially" from me or from someone authorized by me, but copied a friend's version, you are not in my database of users. You may also have an old version which has since been substantially improved. I don't mind you "bootlegging" PHYLIP (it's free anyway), but you should realize that you may have copied an outdated version. If you are reading this Web page, you can get the latest version just as quickly over Internet. It will help both of us if you get onto my mailing list. If you are on it, then I will give your name to other nearby users when they ask for the names of nearby users, and they are urged to contact you and update your copy. (I benefit by getting a better feel for how many distributions there have been, and having a better mailing list to use to give other users local people to contact). Use the registration form which can be accessed through our web site's registration page.

"Can I make copies of PHYLIP available to the students in my class?"

Generally, yes. Read the Copyright notice near the front of this main documentation page. If you charge money for PHYLIP, other than a minimal charge to cover cost of distribution, or you use it in a service for which you charge money, you will need to negotiate a royalty. But you can make it freely available and you do not need to get any special permission from us to do so.

Additional Frequently Asked Questions, or: "Why didn't it occur to you to ...

... allow the options to be set on the command line?"

We could in Unix and Linux, or somewhat differently in Windows. But there are so many options that this would be difficult, especially when the options require additional information to be supplied such as rates of evolution for many categories of sites. You may be asking this question because you want to automate the operation of PHYLIP programs using batch files (command files) to run in background. If that is the issue, see the section of this main documentation page on "Running the programs in background or under control of a command file". It explains how to set the options using input redirection and a file that has the menu responses as keystrokes.

... include in the package a program to do the Distance Wagner method, (or successive approximations character weighting)?"

In most cases where I have not included other methods, it is because I decided that they had no substantial advantages over methods that were included (such as the programs Fitch, Kitsch, Neighbor, the T option of Mix and Dollop, and the "?" ancestral states option of the discrete characters parsimony programs).

... include in the package ordination methods and more clustering algorithms?"

Because this is not a clustering package, it's a package for phylogeny estimation. Those are different tasks with different objectives and mostly different methods. Mary Kuhner and Jon Yamato have, however, included in Neighbor an option for UPGMA clustering, which will be very similar to Kitsch in results.

... include in the package a program to do nucleotide sequence alignment?"

Well, yes, I should have, and this is scheduled to be in future releases. But multiple sequence alignment programs, in the era after Sankoff, Morel, and Cedergren's 1973 classic paper, need to use substantial computer horsepower to estimate the alignment and the tree together (but see Karl Nicholas's program GeneDoc or Ward Wheeler and David Gladstein's MALIGN, as well as more approximate methods of tree-based alignment used in ClustalW, TreeAlign, or POY).

New Features in This Version

Version 3.6 has many new features:

  • •.Faster (well, less, slow) likelihood programs.

  • •.The DNA and protein likelihood and distance programs allow for rate variation between sites using a gamma distribution of rates among sites, or using a gamma distribution plus a given fraction of sites which are assumed invariant.

  • •.A new multistate discrete characters parsimony program, Pars, that handles unordered multistate characters.

  • •.The Dnapars and Pars parsimony programs can infer multifurcating trees, which sensibly reduces the number of tied trees they find.

  • •.A new protein sequence likelihood program, Proml, and also a version, Promlk which assumes a molecular clock.

  • •.A new restriction sites and restriction fragments distance program, Restdist, that can also be used to compute distances for RAPD and AFLP data. It also allows for gamma-distributed rate variation among DNA sites.

  • •.In the DNA likelihood programs, you can now specify different categories of rates of change (such as rates for first, second, and third positions of a coding sequence) and assign them to specific sites. This is in addition to the ability of the program to use the Hidden Markov Model mechanism to allow rates of change to vary across sites in a way that does not ask you to assign which rate goes with which site.

  • •.The input files for many of the programs are now simpler, in that they do not contain options information such as specification of weights and categories. That information is now provided in separate files with default names such as weights and categories.

  • •.The DNA likelihood programs can now evaluate multifurcating user trees (option U).

  • •.All programs that read in user-defined trees now do so from a separate file, whose default name is intree, rather than requiring them to be in the input file as before.

  • •.The DNA likelihood programs can infer the sequence at ancestral nodes in the interior of the tree.

  • •.Dnapars can now do transversion parsimony.

  • •.The bootstrapping program Seqboot now can, instead of producing a large file containing multiple data sets, be asked instead to produce a weights file with multiple sets of weights. Many programs in this release can analyze those multiple weights together with the original data set, which saves disk space.

  • •.The bootstrapping program Seqboot can pass weights and categories information through to a multiple weights file or a multiple categories file.

  • •.Seqboot can also convert sequence files from Interleaved to Sequential form, or back.

  • •.Seqboot can convert a PHYLIP molecular sequences or discrete characters morphology data file into the NEXUS format, which is used by a number of other phylogeny programs such as MacClade, MrBayes and PAUP*.

  • •.Seqboot can also carry out a number of different methods of permuting the order of characters in a data set. This could be used to carry out the Incongruence Length Difference (or Partition Homogeneity) method of testing homogeneity of data sets.

  • •.Seqboot can also write a sequence data file into one version of an XML format for sequence alignments, for use by programs that need XML input (none of the current PHYLIP programs can yet use this format, but it may be useful in the future).

  • •.Retree can now write tree out into a preliminary version of a new XML tree file format which is in the process of being defined.

  • •.The Kishino-Hasegawa-Templeton (KHT) test which compares user-defined trees (option U) is now joined by the Shimodaira-Hasegawa (SH) test (Shimodaira and Hasegawa, 1999) which corrects for comparisons among multiple tests. This avoids a statistical problem with multiple user trees.

  • •.Contrast can now carry out an analysis that takes into account within-species variation, according to a model similar (but not identical) to that introduced by Michael Lynch (1990). This enables analysis of individuals sampled from the species, in a way that properly takes sampling error into account.

  • •.A new program, Treedist, computes the Robinson-Foulds symmetric difference distance among trees. This measures the number of branches in the trees that are present in one but not the other. It also can compute the Branch Score distance defined by Kuhner and Felsenstein (1994) which takes branch lengths into account.

  • •.Fitch and Kitsch now have an option to make trees by the minimum evolution distance matrix method.

  • •.The protein parsimony program Protpars now allows you to choose among a number of different genetic codes such as mitochondrial codes.

  • •.The consensus tree program Consense can compute the Ml family of consensus tree methods, which generalize the Majority Rule consensus tree method. It can also compute our extended Majority Rule consensus (which is Majority Rule with some additional groups added to resolve the tree more completely), and it can also compute the original Majority Rule consensus tree method which does not add these extra groups. It can also compute the Strict consensus.

  • •.The tree-drawing programs Drawgram and Drawtree have a number of new options of kinds of file they can produce, including Windows Bitmap files, files for the Idraw and FIG X windows drawing programs, the POV ray-tracer, and even VRML Virtual Reality Markup Language files that will enable you to wander around the tree using a VRML plugin for your browser, such as Cosmo Player or Cortona.

  • •.Drawtree now uses my new Equal Daylight Algorithm to draw unrooted trees. This gives a much better-looking tree. Of course, competing programs such as TREEVIEW and PAUP draw trees that look just as good - because they too have started to use my method (with my encouragement). Drawtree also can use another algorithm, the n-body method.

  • •.The tree-drawing programs can now produce trees across multiple pages, which is handy for looking at trees with very large numbers of tips, and for producing giant diagrams by pasting together multiple sheets of paper.

There are many more, lesser features added as well.

Coming Attractions, Future Plans

There are some obvious deficiencies in this version. Some of these holes will be filled in the next few releases (leading to version 4.0). They include:

  1. 1.Obviously we need to start thinking about a more visual mouse/windows interface, but only if that can be used on X windows, Macintoshes, and Windows.

  2. 2.Program Penny and its relatives will improved so as to run faster and find all most parsimonious trees more quickly.

  3. 3.An "evolutionary clock" version of Contml will be done, and the same may also be done for Restml.

  4. 4.We are gradually generalizing the tree structures in the programs to infer multifurcating trees as well as bifurcating ones. We should be able to have any program read any tree and know what to do with it, without the user having to fret about whether an unrooted tree was fed to a program that needs a rooted tree.

  5. 5.In general, we need more support for protein sequences, including a codon model of change, allowing for different rates for synonymous and nonsynonymous changes.

  6. 6.We also need more support for combining runs from multiple loci, allowing for different rates of evolution at the different loci.

  7. 7.We will be expanding our use and production of XML data set files and XML tree files.

  8. 8.A program to align molecular sequences on a predefined User Tree may ultimately be included. This will allow alignment and phylogeny reconstruction to procede iteratively by successive runs of two programs, one aligning on a tree and the other finding a better tree based on that alignment. In the shorter run a simple two-sequence alignment program may be included.

  9. 9.An interactive "likelihood explorer" for DNA sequences will be written. This will allow, either with or without the assumption of a molecular clock, trees to be varied interactively so that the user can get a much better feel for the shape of the likelihood surface. Likelihood will be able to be plotted against branch lengths for any branch.

  10. 10.If possible we will allow use of Hidden Markov Models for correcting for purine/pyrimidine richness variations among species, within the framework of the maximum likelihood programs. That the maximum likelihood programs do not allow for base composition variation is their major limitation at the moment.

  11. 11.The Hidden Markov Model (regional rates) option of Dnaml and Dnamlk will be generalized to allow for rates at sites to gradually change as one moves along the tree, in an attempt to implement Fitch and Markowitz's (1970) notion of "covarions".

  12. 12.A more sophisticated compatibility program should be included, if I can find one.

  13. 13.We are economizing on the size of the source code, and enforcing some standardization of it, by putting frequently used routines in separate files which can be linked into various programs. This will enforce a rather complete standardization of our code.

  14. 14.We will move our code to an object-oriented language, most likely C++. One could describe the language that version 3.4 was written in as "Pascal", version 3.5 as "Pascal written in C", version 3.6 as "C written in C", and maybe version 4.0 as "C++ written in C" and then 4.1 as "C++ written in C++". At least that scenario is one possibility.

There will also be many future developments in the programs that treat continuously-measured data (quantitative characters) and morphological or behavioral data with discrete states, as I have new ideas for analyzing these data in ways that connect to within-species quantitative genetic analyses. This will compete with parsimony analysis.

Other Phylogeny Programs Available Elsewhere

A comprehensive list of phylogeny programs is maintained at the PHYLIP web site on the Phylogeny Programs pages:

http://evolution.gs.washington.edu/phylip/software.html

Here we will simply mention some of the major general-purpose programs. For many more and much more, see those web pages.

PAUP* A comprehensive program with parsimony, likelihood, and distance matrix methods. It competes with PHYLIP to be responsible for the most trees published. Written by David Swofford and distributed by Sinauer Associates of Sunderland, Massachusetts. It is described in a web page. at http://www.sinauer.com/detail.php?id=8060. Current prices are $100 for the Macintosh version, $85 for the Windows version, and $150 for Unix versions for many kinds of workstations.

MrBayes The leading program for Bayesian inference of phylogenies. It uses Markov Chain Monte Carlo inference to assess support for clades and to infer posterior distrubutions of parameters. Produced by John Huelsenbeck and Fredrik Ronquist, it is available at its web site at http://mrbayes.net as a Mac OS X or Windows executable, or in source code in C.

MEGA A program by Sudhir Kumar of Arizona State University (written together with Koichiro Tamura, Joel Dudley and Masatoshi Nei). It can carry out parsimony and distance matrix methods for DNA sequence data. Version 4 for Windows, Macintosh, and Linux can be downloaded from the MEGA web site at http://www.megasoftware.net.

MacClade An interactive Macintosh program to rearrange trees and watch the changes in the fit of the trees to data as judged by parsimony. MacClade has a great many features including a spreadsheet data editor and many different descriptive statistics for different kinds of data. It is particularly designed to export and import data to and from PAUP*. MacClade is available for $125 from Sinauer Associates, of Sunderland, Massachusetts. It is described in a web page at http://www.sinauer.com/detail.php?id=4707 . MacClade is also described on its Web page, at http://phylogeny.arizona.edu/macclade/macclade.html.

PAML Ziheng Yang of the Department of Genetics and Biometry at University College, London has written this package of programs to carry out likelihood analysis of DNA and protein sequence data. It is one of the only packages able to use the codon model for protein sequence data which takes the genetic code reasonably fully into account. PAML is particularly strong in the options for coping with variability of rates of evolution from site to site, though it is less able than some other packages to search effectively for the best tree. It is available as C source code and as Macintosh and Windows executables from its web site at http://abacus.gene.ucl.ac.uk/software/paml.html .

TREE-PUZZLE This package by Korbinian Strimmer, Heiko Schmidt and Arndt von Haeseler was begun when Von Haeseler and Strimmer were at the Universität Munchen in Germany. TREE-PUZZLE can carry out likelihood methods for DNA and protein data, searching by the strategy of "quartet puzzling" which they invented. It can also compute distances. It superimposes trees estimated from many quartets of species. TREE-PUZZLE is available for Unix, Macintoshes, or Windows from their web site at http://www.tree-puzzle.de/ .

DAMBE A package written by Xuhua Xia of the Department of Biology of the University of Ottawa. Its initials stand for Data Analysis in Molecular Biology and Evolution. DAMBE is a general-purpose package for DNA and protein sequence phylogenies. It can read and convert a number of file formats, and has many features for descriptive statistics, and can compute a number of commonly-used distance matrix measures and infer phylogenies by parsimony, distance, or likelihood methods, including bootstrapping and jackknifing. There are a number of kinds of statistical tests of trees available and it can also display phylogenies. DAMBE includes a copy of ClustalW as well; DAMBE consists of Windows executables. It is available from its web site at http://dambe.bio.uottawa.ca/dambe.asp .

NONA Pablo Goloboff, of the Instituto Miguel Lillo in Tucumán, Argentina has written this very fast parsimony program, capable of some relevant forms of weighted parsimony, which can handle either DNA sequence data or discrete characters. It is available with some companion programs from http://www.cladistics.com/aboutNona.htm .

TNT This program, by Pablo Goloboff, J. S. Farris, and Kevin Nixon, is for searching large data sets for most parsimonious trees. The authors are respectively at the Instituto Miguel Lillo in Tucumán, Argentina, the Naturhistoriska Riksmuseet in Stockholm, Sweden, and the Hortorium, Cornell University, Ithaca, New York. TNT is described as faster than other methods, though not faster than NONA for small to medium data sets. It is distributed as Windows, Linux, and Mac OS X executables (the latter two require the PVM Parallel Virtual Machine library to be installed). The program and some support files including documentation are available from its download area at http://www.zmuc.dk/public/phylogeny/tnt (see the ReadMe! web page there). It is free, provided you agree to a license with some reasonable limitations.

These are only a few of the over 383 different phylogeny packages that are now available (as of July, 2008 - the number keeps increasing). The others are described (and web links and ftp addresses provided) at my Phylogeny Programs web pages at the address given above.

How You Can Help Me

Simply let me know of any problems you have had adapting the programs to your computer. I can often make "transparent" changes that, by making the code avoid the wilder, woolier, and less standard parts of C, not only help others who have your machine but even improve the chance of the programs functioning on new machines. I would like fairly detailed information on what gave trouble, on what operating system, machine, and (if relevant) compiler, and what had to be done to make the programs work. I am sometimes able to do some over-the-telephone trouble-shooting, particularly if I don't have to pay for the call, but electronic mail is a the best way for me to be asked about problems, as you can include your input and output files so I can see what is going on (please do not send them as Attachments, but as part of the body of a message). I'd really like these programs to be able to run with only routine changes on absolutely everything, down to and possibly including the Amana Touchmatic Radarange Microwave Oven which was an Intel 8080 system (in fact, early versions of this package did run successfully on Intel 8080 systems running the CP/M operating system). A PalmPilot version was contemplated too.

I would also like to know timings of programs from the package, when run on the three test input files provided above, for various computer and compiler combinations, so that I can provide this information in the section on speeds of this document.

For the phylogeny plotting programs Drawgram and Drawtree, I am particularly interested in knowing what has to be done to adapt them for other graphic file formats.

You can also be helpful to PHYLIP users in your part of the world by helping them get the latest version of PHYLIP from our web site and by helping them with any problems they may have in getting PHYLIP working on their data.

Your help is appreciated. I am always happy to hear suggestions for features and programs that ought to be incorporated in the package, but please do not be upset if I turn out to have already considered the particular possibility you suggest and decided against it.

In Case of Trouble

Read The (documentation) Files Meticulously ("RTFM"). If that doesn't solve the problem, please check the Frequently Asked Questions web page at the PHYLIP web site:

http://evolution.gs.washington.edu/phylip/faq.html

and the PHYLIP Bugs web page at that site:

http://evolution.gs.washington.edu/phylip/bugs.html

If none of these answers your question, get in touch with me. My email address is given below. If you do ask about a problem, please specify the program name, version of the package, computer operating system, and send me your data file so I can test the problem. Also it will help if you have the relevant output and documentation files so that you can refer to them in any correspondence. I can also be reached by telephone by calling me in my office: +1-(206)-543-0150, or at home: +1-(206)-526-9057 (how's that for user support!). If I cannot be reached at either place, a message can be left at the office of the Department of Genome Sciences, +1-(206)-221-7377 but I prefer strongly that I not call you, as in any phone consultation the least you can do is pay the phone bill. Better yet, use email.

Particularly if you are in a part of the world distant from me, you may also want to try to get in touch with other users of PHYLIP nearby. I can also, if requested, provide a list of nearby users.

Joe Felsenstein
Department of Genome Sciences
University of Washington
Box 355065
Seattle, Washington 98195-5065, U.S.A.

Electronic mail addresses: joe (at) gs.washington.edu

Your opinions
Your name:Email:Website url:Opinion content:
- no title specified

HxLauncher: Launch Android applications by voice commands