Beagle can read data files in
an independtly of the format chosen, sequences can be parsed as
The default behaviour of beagle is to first try to parse the data file using its own format, and if that fails, try to parse it as FASTA format. Either format can be specifically selected by command line options. For sequence representation, the default behaviour is to assume a binary format, while amino acid and nucleic acid formats have to be explicitly selected by command line arguments.
The basic assumption for the native format is that each line in the input specifies a unique sequence of the data, where all white spaces (e.g. space and tab) are ignored. The sequences are required to have the same length. The two twists to this simple format are comment lines and the escape character.
Any line starting with the character # are interpreted
as comments. In general, these lines are simply ignored when reading
the data file. Depending on the text following the #
character, the line can specify sequence labels or site labels:
# is > (note the inspiration from the
FASTA format), the remainder of the line specifies the label of the
next sequence in the
data. E.g. #> Sequence 1 would
specify Sequence 1 as the label of the next
sequence. Note that a sequence can have only one label. If two or
more labels are specified before the next sequence, the last one is
taken to be the sequence label. Is is legal only to specify labels
for some of the sequences in the data.positions: (where case is ignored,
e.g. PoSiTiOnS: would also be interpreted as
indicating site labels), the remainder of the line is assumed to
contain a white space separated list of site labels. The site labels
can be specified over several comment lines tagged with
positions: — the site labels specified
throughout the file are concatenated into the full list of site
labels. E.g. if
# positions: 0.2 0.3 is the first
positions: tagged line, the list of site labels would
be initialised to 0.2, 0.3. If
#positions:0.7 was later encounter, 0.7
would be added to this list. If any site labels are specified in the
data file, the number of site labels are required to equal the
length of the sequences.The \ character receives special treatment. It is not
considered one of the characters in the input, but changes the way the
following character is interpreted:
n, t,
or b, it is interpreted as respectively a newline, a
tab, or a backspace.\ character occurs at the end of a line, the newline is
ignored. This provides a mechanism for `gluing' two or more lines
together. E.g.
# positions: 0.2 0.3\ 0.7would be read as the single line
# positions: 0.2 0.3 0.7. Note that
the escaped newline is not converted to a space. If there had been
no space at the beginning of the second line in this example, it
would specify only two site labels, 0.2 and
0.30.7.\ character is superfluous,
but it provides a mechanism for specifying a \
(\\) or a white space character that would otherwise
have been ignored. E.g.
# positions: Pos\ 1 Pos\ 2
specifies only two site labels, Pos 1 and
Pos 2.In the FASTA format each sequence is represented by a block, where
the first line starts with one or more > characters,
followed by the sequence label (initial and trailing white spaces in
this label are ignored). Then follows zero or more comment lines, each
starting with one or more ; characters. As for the native
format, if the first non-white space text following the ;
characters is positions: (with case ignored) the
remainder of the line is interpreted as a white space separated list
of site labels. If the ; characters are not followed by
positions, the comment line is simply ignored. The block
ends with one or more lines specifying the sequence. White spaces in
these lines are ignored, all other characters are interpreted
verbatim. As for the native format, sequences are required to be of
equal length, and if any site labels are specified the number of site
labels are required to equal the length of the sequences.
Beagle can interpret sequences based on three different representation: binary, amino acid, and nucleic acid. The most versatile representation is binary, allowing wild and mutant types to be clearly distinguished when the most recent common ancestor is known. The other two representations are included for convenience of users who do not have a simple way to convert their data to binary format.
The two different types in a site are represented by 0
and 1. Any character not a 0 or a
1 is interpreted as an unresolved site. If the most
recent common ancestor is known, 1 is treated as the
mutant type in each site.
Resolved sites should be specified by a character from the one
letter amino acid alphabet: A, C,
D, E, F, G,
H, I, K, L,
M, N, P, Q,
R, S, T, V,
W, and Y. Case is ignored,
i.e. a and A both encodes the amino
acid alanine. The first resolved character in a site determines one
type of the site, any other amino acid alphabet character is
considered to be the alternative type.
Beagle is based on the infinite sites assumption, so a site should only contain at most two types of amino acids. If more than two types of amino acids occur in a site, the first type is treated as distinct while all remaining types are lumped into one class.
The first sequence in the data set is assumed to be the most recent common ancestor, if this is known. If there are any unresolved sites in this sequence, then the most recent common ancestor type in these sites are inferred from the next sequence in the data set, and so on.
Resolved sites should be specified by one of the five characters
A, C, G,T, and
U. T (thymine) and U (uracil)
are treated as identical nucleic acids. Again case is ignored, only
the first nucleic acid type occurring in a site is treated as
distinct, and the first sequence in the data set is assumed to be the
most recent ancestor, if this is known.
With no further information, the four gamete data set is simply specified as
00 01 10 11A more elaborate way of specifying the same data set, using comment lines and the escape character would be
# Four gamete data set 00 # Escape character used to glue consecutive lines together: 0\ 1 # Superfluous escape character 1\0 11Here neither sites nor sequences are labelled. If we want to label the first sequence Sequence 1, the fourth sequence
Sequence 2, and the sites
Site 1 and Site 2, this can be
done as
# Four gamete data set #> Sequence 1 00 01 10 #> Sequence 2 11 # Positions: Site\ 1 Site\ 2Note that
# Four gamete data set #> Sequence 1 #> Sequence 2 #> Sequence 3 #> Sequence 4 00 01 10 11would only assign a label to the first sequence, which would be labelled
Sequence 4.
Finally, assume that the second site in the third sequence was unresolved. This can be done as
# Three and a half gamete data set 00 01 1x 11The character used to specify an unresolved site does not have to be
x — any non-special character apart from
0 and 1 could have been used.
# Amino acid four gamete data set Ai AL pL piNote that alanine is considered the wild type in the first site and isoleucine is considered the wild type in the second site, if the most recent common ancestor is known.
# Nucleic acid four gamete data set CA CG TA UGAgain, cytosine is considered the wild type in the first site and adenosine the wild type in the second site, if the most recent common ancestor is known.
Sequence 1 through
Sequence 4, and the sites labelled
Site1 and Site2 (there is no mechanism for
specifying white spaces in site labels in FASTA format). This can be
done as
>>> Sequence 1 ;;; positions: Site1 Site2 00 > Sequence 2 01 > Sequence 3 10 > Sequence 4 11As for the native format, the sequences could also have been represented as amino acid or nucleic acid sequences. Unresolved sites are specified by a character not in the alphabet of the chosen representation.