Beagle Input Formats

Beagle can read data files in

an independtly of the format chosen, sequences can be parsed as

The default behaviour of beagle is to first try to parse the data file using its own format, and if that fails, try to parse it as FASTA format. Either format can be specifically selected by command line options. For sequence representation, the default behaviour is to assume a binary format, while amino acid and nucleic acid formats have to be explicitly selected by command line arguments.

Formats

Beagle can read data files in two different formats, its own native format and a slightly extended FASTA format. The extension of the FASTA format consists of assigning meaning to some comment lines, depending on their context.

Native format

The basic assumption for the native format is that each line in the input specifies a unique sequence of the data, where all white spaces (e.g. space and tab) are ignored. The sequences are required to have the same length. The two twists to this simple format are comment lines and the escape character.

Comment lines

Any line starting with the character # are interpreted as comments. In general, these lines are simply ignored when reading the data file. Depending on the text following the # character, the line can specify sequence labels or site labels:

Sequence labels
If the first non-white space text after the initial # is > (note the inspiration from the FASTA format), the remainder of the line specifies the label of the next sequence in the data. E.g. #> Sequence 1 would specify Sequence 1 as the label of the next sequence. Note that a sequence can have only one label. If two or more labels are specified before the next sequence, the last one is taken to be the sequence label. Is is legal only to specify labels for some of the sequences in the data.
Site labels
If the first non-white space text after the initial # is positions: (where case is ignored, e.g. PoSiTiOnS: would also be interpreted as indicating site labels), the remainder of the line is assumed to contain a white space separated list of site labels. The site labels can be specified over several comment lines tagged with positions: — the site labels specified throughout the file are concatenated into the full list of site labels. E.g. if # positions: 0.2 0.3 is the first positions: tagged line, the list of site labels would be initialised to 0.2, 0.3. If #positions:0.7 was later encounter, 0.7 would be added to this list. If any site labels are specified in the data file, the number of site labels are required to equal the length of the sequences.

The escape character

The \ character receives special treatment. It is not considered one of the characters in the input, but changes the way the following character is interpreted:

FASTA format

In the FASTA format each sequence is represented by a block, where the first line starts with one or more > characters, followed by the sequence label (initial and trailing white spaces in this label are ignored). Then follows zero or more comment lines, each starting with one or more ; characters. As for the native format, if the first non-white space text following the ; characters is positions: (with case ignored) the remainder of the line is interpreted as a white space separated list of site labels. If the ; characters are not followed by positions, the comment line is simply ignored. The block ends with one or more lines specifying the sequence. White spaces in these lines are ignored, all other characters are interpreted verbatim. As for the native format, sequences are required to be of equal length, and if any site labels are specified the number of site labels are required to equal the length of the sequences.

Sequence Representation

Beagle can interpret sequences based on three different representation: binary, amino acid, and nucleic acid. The most versatile representation is binary, allowing wild and mutant types to be clearly distinguished when the most recent common ancestor is known. The other two representations are included for convenience of users who do not have a simple way to convert their data to binary format.

Binary representation

The two different types in a site are represented by 0 and 1. Any character not a 0 or a 1 is interpreted as an unresolved site. If the most recent common ancestor is known, 1 is treated as the mutant type in each site.

Amino acid

Resolved sites should be specified by a character from the one letter amino acid alphabet: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. Case is ignored, i.e. a and A both encodes the amino acid alanine. The first resolved character in a site determines one type of the site, any other amino acid alphabet character is considered to be the alternative type.

Beagle is based on the infinite sites assumption, so a site should only contain at most two types of amino acids. If more than two types of amino acids occur in a site, the first type is treated as distinct while all remaining types are lumped into one class.

The first sequence in the data set is assumed to be the most recent common ancestor, if this is known. If there are any unresolved sites in this sequence, then the most recent common ancestor type in these sites are inferred from the next sequence in the data set, and so on.

Nucleic acid

Resolved sites should be specified by one of the five characters A, C, G,T, and U. T (thymine) and U (uracil) are treated as identical nucleic acids. Again case is ignored, only the first nucleic acid type occurring in a site is treated as distinct, and the first sequence in the data set is assumed to be the most recent ancestor, if this is known.

Examples

In this section we will use the four gamete data set as running example, and show different ways to specify this.

Native format

Binary representation

With no further information, the four gamete data set is simply specified as

00
01
10
11
A more elaborate way of specifying the same data set, using comment lines and the escape character would be
# Four gamete data set
00
# Escape character used to glue consecutive lines together:
0\
1
# Superfluous escape character
1\0
11
Here neither sites nor sequences are labelled. If we want to label the first sequence Sequence 1, the fourth sequence Sequence 2, and the sites Site 1 and Site 2, this can be done as
# Four gamete data set
#> Sequence 1
00
01
10
#> Sequence 2
11
# Positions: Site\ 1 Site\ 2
Note that
# Four gamete data set
#> Sequence 1
#> Sequence 2
#> Sequence 3
#> Sequence 4
00
01
10
11
would only assign a label to the first sequence, which would be labelled Sequence 4.

Finally, assume that the second site in the third sequence was unresolved. This can be done as

# Three and a half gamete data set
00
01
1x
11
The character used to specify an unresolved site does not have to be x — any non-special character apart from 0 and 1 could have been used.

Amino acid representation

Assume that the four gamete data is really from a protein data set, where the first segregating site contains leucine and isoleucine, and the second segregating site contains alanine and proline. This data set can be represented by
# Amino acid four gamete data set
Ai
AL
pL
pi
Note that alanine is considered the wild type in the first site and isoleucine is considered the wild type in the second site, if the most recent common ancestor is known.

Nucleic acid representaions

If the four gamete data is from a mixed DNA and RNA data set, where the first segregating site contains cytosine and thymine/uracil and the second segregating site contains adenosine and guanine, one possible such data set can be represented as
# Nucleic acid four gamete data set
CA
CG
TA
UG
Again, cytosine is considered the wild type in the first site and adenosine the wild type in the second site, if the most recent common ancestor is known.

FASTA format

In the FASTA format, all sequences have to be labelled. So assume that we want the sequences labelled Sequence 1 through Sequence 4, and the sites labelled Site1 and Site2 (there is no mechanism for specifying white spaces in site labels in FASTA format). This can be done as
>>> Sequence 1
;;; positions: Site1 Site2
00
> Sequence 2
01
> Sequence 3
10
> Sequence 4
11
As for the native format, the sequences could also have been represented as amino acid or nucleic acid sequences. Unresolved sites are specified by a character not in the alphabet of the chosen representation.
Rune Lyngsų, rlyngsoe@stats.ox.ac.uk