This file describes the technical aspects of the use of the modules included in IStVaN (Invariant Set Variants for Normalisation). For issues relating to legal aspects of the use of these modules, the reader is referred to the GNU general public license that should be part of the distribution as gpl.txt. Otherwise it is quite easy to locate on the internet. Most of the modules in this package have been implemented using efficient algorithms and data structures. So whenever data needs to be sorted one can expect this to be done by O(n log n) mergesort, an item of a specific rank is located in linear time, and splay trees with amortised update and lookup times of O(log n) are used for maintaining data that is incrementally updated. The one glaring omission are a few of the functions in polynomial.c. For solving the equations required to determine the polynomial minimising sum of squared deviations, a more stable method than Gaussian elimination would be preferable. And multiplication of polynomials would have been better implemented using fast Fourier transforms instead of the quadratic time calculation of all cross products that is in fact used. This should not be a reason for concern as long as the polynomial module is only used for inferring normalisation functions by polynomials of low degree. But the module should not be put to use in settings where polynomials with degrees into the hundreds, or even more, are needed. A primary concern in the development of this library was speed. Hence, sanity checks of the passed parameters are at best sporadic, and should not be relied upon. It is left for the user to make sure that e.g. the methods are not called with a non-positive number of points, negative thresholds, or negative exponential bases. Normalisation Function Modules ------------------------------ polynomials.c/polynomials.h This module implements functions for finding polynomials of arbitrary degree minimising the sum of squared deviations to a given, possibly weighted, set of points. Domain and range of the polynomials are restricted to one dimension, i.e. the points are specified by an x value and a y value. The module uses simple Gaussian elimination for solving the set of linear equations resulting from the optimum requirement. This should really be replaced by a more robust technique, e.g. as described in chapter 31.4 of Introduction to Algorithms by Cormen, Leiserson, and Rivest, but for polynomials of low degree inferred from a large set of points there shouldn't be any immediate concerns. The module defines one type, Polynomial, for representing a polynomial. One macro, _POLYNOMIAL_THRESHOLD, determines the maximum pointwise deviation from the unity matrix that is allowed for the product of the equation system matrix and its inverse, as computed by Gaussian elimination, before a warning is sent to stderr. The public functions of the module are: polynomial_new(n, a): Given an array a of n, where n is an unsigned integer, doubles, a new polynomial of degree n - 1 with i'th order coefficient a[i] is returned. polynomial_wlsdpolynomial(n, m, x, y, w, p, a, b): x, y, and w are arrays (represented by pointers to their start address), each of m doubles, containing the x value, the y value and the weight of m points. m and n are integers specifying the number of points and the number of coefficients of the polynomial that is inferred from the points (i.e. the degree of the polynomial is n - 1), respectively. p should be a double between 0 and 1, specifying the weight given to a smoothness term, and a and b are doubles specifying the interval over which the smoothness term is computed. The function returns the polynomial q(z) that minimises (1 - p')sum_i=1^m w_i(y_i - q(x_i))^2 + p' int_a^b (q''(z))^2 dz as a pointer to a value of type Polynomial, where p' is a rescaling of p corresponding to a rescaling of the weights to have mean 1. If the optimum polynomial is not unique, NULL is returned. polynomial_lsdpolynomial(n, m, x, y, p, a, b): Similar to polynomial_wlsdpolynomial, but all points are given a weight of 1. polynomial_fit(n, x, y): x and y are arrays of n doubles, where n is an integer and x and y are represented by pointers to the first element of the array. The function returns the unique polynomial of degree n - 1 fitting the n points as pointer to a value of type Polynomial. polynomial_product(p1, p2): The function takes pointers to two values of type Polynomial, p1 and p2, and returns a pointer to a third value of type Polynomial, containing the product polynomial of p1 and p2. This function uses a simple quadratic time multiplication procedure and should really by reimplemented using fast Fourier transformation. polynomial_difference(p1, p2): Similar to polynomial_product, but returns the difference polynomial. polynomial_integral(p, a, b): Takes a pointer p to a value of type Polynomial and two doubles a and b and returns the value of the integral of the polynomial from a to b as a double. polynomial_evaluate(p, x): Takes a pointer p to a value of type Polynomial and a double x and returns p(x) as a double. polynomial_isd(p1, p2, a, b): Takes pointers p1 and p2 to two values of type Polynomial and two doubles a and b and returns the value of integral_a^b (p1(x) - p2(x))^2 dx as a double. polynomial_print(p, format, fp): Takes a pointer p to a value of type Polynomial, a string format specifying the format for printing the polynomial coefficients of type double in standard printf syntax, and a file pointer fp and outputs a textual representation of the polynomial to fp starting from the low order terms. If format is NULL the coefficients are output with three digits after the decimal point. If fp is NULL stdout is used for output. polynomial_free(p): Frees the memory used to represent a polynomial as a value of type Polynomial to which p is a pointer. Module dependencies: wrappers loess.c/loess.h This module implements functions for finding loess fits of arbitrary degree to a given, possibly weighted, set of points. Domain and range of the loess fits are restricted to one dimension, i.e. the points are specified by an x value and a y value. Loess fits are local polynomial fits for which the functions of polynomial are used. The reader is referred to A Package of C and Fortran Routines for Fitting Local Regression Models by Cleveland, Grosse, and Shyu for further reading on loess fits. The module defines one type, Loess, for representing a loess fit. Two macros, _LOESS_ROBUSTSTOP and _LOESS_ROBUSTMAXITERATIONS, controls when the iterative procedure of robustifying a loess fit is terminated; either all changes in weights should be smaller than _LOESS_ROBUSTSTOP or _LOESS_ROBUSTMAXITERATIONS iterations should have been reached. The public functions of the module are: loess(n, x, y, alpha, lambda): n is an integer and x and y are pointers to the first elements of arrays each of n doubles containing the x values and y values, respectively, of n points. alpha is a non-negative double specifying the neighbourhood parameter (for alpha less than one, only the fraction of alpha points closed to the evaluation point are used for inference) and lambda is an integer specifying the degree (polynomials of degree lambda are used for the local fits). The loess fit obtained from these parameters is returned as a pointer to a value of type Loess (by and large, the only thing done at this point is a sorting of the points according to x value and computing the number of points used for inference - the actual local polynomial inference cannot be carried out until the evaluation point is known). wloess(n, x, y, w, alpha, lambda): Similar to loess, but an extra pointer w to the first element of an array of weights for each of the n points is assumed. loess_evaluate(l, a): Evaluates the loess fit l, given as a pointer to a value of type Loess, at a, given as a double, and returns the value as a double. The local polynomial fit is not computed until at this point. loess_robustify(l): The loess fit l given as a pointer to a value of type Loess iteratively has the weights of the points updated according to the robustification procedure for loess fits. Note that this function does not create a new loess fit but changes the weights stored in l. The pointer l is returned if successful, otherwise NULL is returned. loess_nisd(l1, l2, a, b, n): Takes pointers l1 and l2 to two values of type Loess and two doubles a and b and returns an approximation of the value of int_a^b (l1(x) - l2(x))^2 dx as a double. The approximation is computed by dividing the range from a to b into n, where n is an integer, subintervals of equal size, for each interval finding polynomials of degree equal to the order of l1 and l2, respectively, and fitting the loess fits at the number of points necessary to uniquely determine the polynomial evenly spaced across the subinterval, and summing integral of squared differences between the polynomial representatives for each subinterval. loess_changealpha(alpha, l): Takes a double alpha and a pointer l to a value of type Loess and changes the neighbourhood parameter of the loess fit to alpha. loess_changelambda(lambda, l): Takes an integer lambda and a pointer to a value of type Loess and changes the degree of the loess fit to lambda. loess_free(l): Frees the memory used to represent a loess fit as a value of type Loess to which l is a pointer. Module dependencies: iploess, polynomial, mergesort, rankselect, wrappers iploess.c/iploess.h This module implements functions for finding approximations to loess fits of arbitrary degree by finding polynomials of the same degree that fits the loess fit in equidistant points in intervals. One problem with loess fits is that we need to make a new local polynomial inference at each point where it is evaluated. This requires time linear in the number of points used for inference, and as the robustification procedure requires evaluation of the fit at each point used for inference each iteration of the robustification procedure requires time quadratic in the number of points used for inference. This module approximates a loess fit by precomputing polynomials in a specified range divided into a specified number of intervals. The module defines one type, IPLoess, for representing an approximate (or interpolation, hence the IP) loess fit. Two macros, _IPLOESS_ROBUSTSTOP and _IPLOESS_ROBUSTMAXITERATIONS, controls when the iterative procedure of robustifying a loess fit is terminated; either all changes in weights should be smaller than _IPLOESS_ROBUSTSTOP or _IPLOESS_ROBUSTMAXITERATIONS iterations should have been reached. The public functions of the module are: iploess(l, n, a, b): l is a loess fit represented as a pointer to a value of type Loess that is approximated by dividing the range given by doubles a and b into n intervals of equal sizes, where n is an integer. In each interval l is evaluated at lambda + 1 evenly spaced points, starting and ending at the interval boundaries, and the unique polynomial of degree lambda fitting these points is determined as the approximation of the loess fit in this interval, where lambda is the degree of the loess fit. For values smaller than a and larger than b, two polymoials of degree lambda based on the fraction of alpha smallest, respectively largest, inference points of l are used for approximation, where alpha is the neighbourhood parameter of the loess fit. The loess fit approximation is returned as a pointer to a value of type IPLoess. iploess_robust(l, n, a, b): Similar to iploess, but also robustifies the fit according to the robust loess fitting procedure. Approximate loess fits are used for evaluating the fit at the inference points in the robustification procedure, hence this should be faster (but also a slight approxiamtion) than first robustifying l using loess_robustify from the loess module before converting it to an approximate loess fit using iploess. The robustified loess fit approximation is returned as a pointer to a value of type IPLoess. iploess_evaluate(ipl, x): Takes a pointer to a value of type IPLoess and a double x and returns the approximate loess fit evaluated at the point x as a double. iploess_isd(ipl1, ipl2, a, b): Takes pointers ipl1 and ipl2 to two values of type IPLoess and two doubles a and b and returns the value of int_a^b (ipl1(x) - ipl2(x))^2 dx as a double. iploess_free(ipl): Frees the memory used to represent a loess fit as a value of type Loess to which l is a pointer. Module dependencies: loess, polynomial, mergesort, rankselect, wrappers Invariant Set Modules --------------------- The invariant set methods in this collection have been implemented to work with other types of normalisation functions than just the ones included in this collection. Hence, it is necessary for anyone wanting to use these modules to be familiar with passing functions as parameters to other functions in C. This could probably all have been done with more elegance using an object oriented language, as the best way to pass extra parameters to these functions appears to me to be through global varaibles. Examples of how this is done can be found in the istvan.c source file. For the faint of heart, all modules but the iterated global normalisation module will return an array with the weight assigned to each point if the function pointer is NULL. absolute_rank_selection.c/absolute_rank_selection.h This module implements functions for finding an invariant set based on absolute rank (or rank invariant) selection. The rank of a gene in an experiment is the position of that gene if genes are listed in order of increasing measured expression level. Absolut rank selection is based on choosing genes where the difference in rank between the two experiments does not exceed a given threshold. The module defines one type, _ARS_newfunction, that is a function requiring three arguments, an integer and two arrays of doubles, and returning a void pointer. This is the type of function that the invariant set determined by absolute rank selection is passed to for actual normalisation function inference. The public functions of the module are: absolute_rank_selection(n, x, y, t, new_function): Given n (where n is an integer) points represented by two arrays, x and y, of doubles, and an integer t, it finds all points with a difference in rank when sorted according to the x values and y values, respectively, of at most t. These points are passed to new_function that should be a function of type _ARS_newfunction, and the void pointer returned from this function (which should represent the function inferred from the selected subset of points) is returned from absolute_rank_selection. If new_function is NULL, an array of n doubles is returned with the i'th entry being 1 if the i'th point is a member of the invariant set and otherwise 0. absolute_rank_selection_all(n, x, y, new_function): Similar to absolute_rank_selection, but instead of just choosing one subset based on a specific threshold it invokes new_function on subsets selected by all thresholds t <= n that yields a larger subset than the threshold t - 1. It returns an array of void pointers, where the t'th entry is the return value from new_function if this was invoked on a subset based on threshold t, and otherwise NULL. If new_function is NULL, the non-NULL elements of the array will be arrays of doubles specifying the invariant sets computed as described for absolute_rank_selection. Module dependencies: mergesort, wrappers iterated_ars.c/iterated_ars.h This module implements functions for finding an invariant set based on iterated absolute rank (or rank invariant) selection. Where absolute rank selection uses a fixed threshold, whether a point should be retained in iterated absolute rank selection may depend on its expression level in the two experiments, the minimum and maximum expression levels of the points not (yet) eliminated, the rank of the point among the points not (yet) eliminated in the two experiments, and the number of points not (yet) eliminated. The point elimination continues on the remaining set of points until no more points are eliminated in an itereation, or until the next iteration would bring the remaining number of points below a fixed fraction of the original number of points. The module defines two types, _IARS_newfunction that is equivalent to _ARS_newfunction (described under the absolute_rank_selection module), and _IARS_threshold that is the type of function used to determine whether a point should be retained in the current iteration. A function of type _IARS_threshold takes 9 arguments: the first two arguments are the two measurements of the point represented as doubles (i.e. x[i] and y[i] for the i'th point), the next two arguments are the minimum and maximum measurements of the remaining points in experiment 1 represented as doubles (i.e. the minimum and maximum values of x[i] taken over all points where the i'th point has not (yet) been eliminated), the next two arguments are the minimum and maximum measurements of the remaining points in experiment 2 represented as doubles, the next two arguments are the ranks of the point among the remaining points when sorted according to expression levels in the two experiments, represented as integers, and the last argument is the remaining number of points represented as an integer. The function should return a integer that is non-zero if the point should be retained, and the integer 0 if the point should be eliminated from the set of remaining points. One macro, _IARS_MINIMUM_FRACTION, defines the minimum fraction of the original points that should be retained, i.e. if a further iteration would bring the remaining set of points below this fraction the iterative elimination of points is stopped short of this iteration. The public function of the module is: iterated_ars(n, x, y, new_function, threshold): given n (where n is an integer) points represented by two arrays, x and y, of doubles, the iterative absolute rank selection procedure is applied, passing each remaining point to the threshold function, that should be of type _IARS_threshold, in each iteration to determine whether the point should be retained or eliminated. The set of points not eliminated in any of the iterations are passed to the new_function function, that should be of type _IARS_newfunction, and the void pointer returned from this function (which should represent the function inferred from the selected subset of points) is returned from iterated_ars. If new_function is NULL, an array of n doubles is returned where the i'th element is 1 if the i'th point was not eliminated in any of the iterations and 0 otherwise. Module dependencies: mergesort, wrappers relative_rank_selection.c/relative_rank_selection.h This module implements functions for finding an invariant set based on relative rank (or order invariant) selection. Where absolute rank selection discriminates based on the difference in absolute ranks of a point in the two experiments, relative rank selection requires that a point is a member of a large subset of points with conserved ordering, or relative ranking, in the two experiments. The module defines one type, _RRS_newfunction, that is identical to _ARS_newfunction. The public functions of the module are: relative_rank_selection(n, x, y, t, new_function): Given n (where n is an integer) points represented by two arrays, x and y, of doubles the function determines the maximum size of any subset of points with conserved ordering according to the two experiments (i.e. if we only consider the points in the subset and sort them according to each of the two experiments, the rank of any of the points will be the same in each of the experiments). It then proceeds to find all points that are members of a subset with conserved ordering of size at most the given threshold of t (where t is an integer) smaller than the maximum size. These points are passed to new_function, that should be of type _RRS_newfunction, and the void pointer returned from this function (which should represent the function inferred from the selected subset of points) is returned from relative_rank_selection. If new_function is NULL, an array of n doubles is returned where the i'th element is 1 if the i'th point is a member of subset with conserved ordering of size at most t smaller than a maximum subset with conserved ordering and 0 otherwise. relative_rank_selection_all(n, x, y, new_function): Similar to relative_rank_selection, but instead of just choosing one subset based on a specific threshold it invokes new_function on subsets selected by all thresholds t <= n that yields a larger subset than the threshold t - 1. It returns an array of void pointers, where the t'th entry is the return value from new_function if this was invoked on a subset based on threshold t, and otherwise NULL. If new_function is NULL, the non-NULL elements of the array will be arrays of doubles specifying the invariant sets computed as described for relative_rank_selection. Module dependencies: mergesort, wrappers reliability.c/reliability.h This module implements a method we call reliability, a method that can be viewed as a continuous value derivative of relative rank selection. If we assign an `energy' to each subset of points with conserved ordering that is proportional to the size of the subset, then we also have a probability of `observing' a particular subset with conserved realtive ordering defined by the Boltzmann distribution, i.e. the probability of `observing' a particular subset at a given `temperature' is an exponential with an appropriate base of the energy of the subset. The probability of `observing' a paricular point, which is the entity we refer to as the points reliability, is just the probability of `observing' a subset containing that point. The reliability of the points are used to weight their influence on the inferred normalising function. The module defines two types. _RELIABILITY_newfunction is a function requiring four arguments, an integer and three arrays of doubles, and returning a void pointer; this is the type of function that the points and the reliability weights are passed to for actual normalisation function inference. _RELIABILITY_transform is a function requiring two arguments, an integer and an array of mp_types (see gmp_interface under auxiliary modules), and with no return value; this function type is used to allow a transformation to be applied to the reliability weights before they are passed on to normalising function inference. The public function of the module is: reliability(n, x, y, p, new_function, transform): Given n (where n is an integer) points represented as two arrays, x and y, of doubles, the reliability of each point is computed using exponential base p, where p is a double that is rounded down to the nearest 1/100 (unless _RELIABILITY_NOROUNDING is defined at compilation time - the reason for rounding is that rational numbers are used for the computations and a unnecessarily precise conversion from double might lead to a large denominator and thus slower computations). If transform, that should be of type _RELIABILITY_transform, is not NULL, the array of computed reliabilities is passed to this function to allow transformation of the reliabilities, e.g. by passing them through a sigmoid function. Finally the points and their (possibly transformed) reliabilities are passed to new_function, that should be of type _RELIABILITY_newfunction, and the void pointer returned from this function (which should represent the function inferred from the weighted set of all the points) is returned from reliability. If new_function is NULL, the pointer returned is to an array of n doubles holding the weights (i.e. reliabilities) assigned to the points. Module dependencies: mergesort, splay, gmp_interface, wrappers iterated_global_normalisation.c/iterated_global_normalisation.h This module implements iterative reestimation of normalising functions by reweighting points according to their deviation from the current normalising function estimate. The user can supply a weight calculation function that is given a points deviation from the current normalising function estimate, as well as mean and median values together with standard deviation from mean as arguments. The default reweighting is done by a function identical to the normal distribution density function. This should not lead one to believe that the weights have any statistical meaning. The choice of the Gaussian as weight function simply comes down to it having an appropriate shape, giving roughly equal high weights to points close to the current normalising function estimate and trailing off rather rapidly. The module defines five types. _IGN_newfunction is identical to _RELIABILITY_newfunction. _IGN_evalfunction is a function that requires two arguments, a void pointer and a double, and returns a double; this is the type of the function used to evaluate the current normalising function at a given value, i.e. given a normalising function estimate represented by the void pointer and the value the normalising function should be evaluated at represented by the double, a double representing the normalising function of that value should be returned. _IGN_cmpfunction is a function that requires two void pointers as arguments and returns an integer; this is the type of the function used to determine whether we have reached convergence, so given the current and previous normalising function estimates a non-zero integer should be returned if the two normalising function estimates exhibit convergence, and 0 should be returned if they do not exhibit convergence. _IGN_freefunction is a function with no return value that takes a single void pointer as argument; this is the type of function that is used to free the memory used by a normalising function estimate at the end of each iteration, i.e. it is applied to the previous normalising function estimate when the current normalising function has been estimated. _IGN_weight is a function that requires four doubles as arguments and returns a double; this is the type of the function used to compute the new weight of a point based on its deviation from the current normalising function estimate, the average deviation of all points from the current normalising function estimate, the standard deviation of all the points to the current normalising function estimate, and the median deviation of all the points to the current normalising function estimate. The module further defines two macros. Even without convergence, after _ITERATED_GLOBAL_NORMALISATION_MAXITERATIONS the iterative reestimation will be terminated and the last normalising function estimate returned. _ITERATED_GLOBAL_NORMALISATION_SLOPE controls the scaling of the function used to assign weights to points; it is multiplied by the standard deviation estimator to give the distance a point has to be from the normalising function estimate to get a weight of 1/2. The public function of the module is: iterated_global_normalisation(n, x, y, new_function, eval_function, cmp_functions, free_function, weight): Given n (where n is an integer) points represented as two arrays, x and y, of doubles, all points are first assigned equal weights of 1 and new_function, that should be of type _IGN_newfunction, is used to find a first estimate of normalising function. In an iterative manner, the distance of each point to the current normalising function estimate is then used to assign new weights, using the function weight that should be of type _IGN_weight if this is not NULL and otherwise the default scheme, to all points, based on which a new normalising function is inferred. This is repeated until convergence is achieved, as reported by cmp_funtions that should be of type _IGN_cmpfunction. eval_function, that should be of type _IGN_evalfunction, is used for evaluating by normalising function estimates, and free_function, that should be of type _IGN_freefunction, is used to release the memory used for storing obsolete normalising function estimates. Contrary to the other invariant set modules in this collection, it is not possible to have a representation of the final weights returned instead of a normalising function estimate by passing a NULL pointer as new_function. The estimation of new normalising functions is an inherent part of this scheme, so new_function has to be non-NULL. To obtain the weights assigned to the points, new_function could store store the weights passed in an array as the ordering of the points when passed to new_function is identical to the initial ordering. Module dependencies: rankselect, wrappers Auxiliary Modules ----------------- Some of the functions and data structures needed to implement the normalisation functions and invariant set methods described above have been seperated out into independent auxiliary modules. In the following these auxiliary modules are described, starting with modules implementing general purpose data structures and modules implementing general purpose algorithms, followed by modules needed in our experiments comparing the different methods, and finishing with three `interface' modules facilitating easy substitution of some data types and functions used. llist.c/llist.h This module implements a doubly linked list data structure, together with a variety of methods for manipulating this data structure. The module defines three types; LList is the doubly linked list type; LListNode is the structure used to store (a reference to) an element stored in an LList, together with references to its predecessor and successor elements; LListCounter is an iterator object for running through the elements stored in an LList, or, more generally, to navigate back and forth through an LList. The module defines two macros, FIRST that can be used to specify the start of the LList as position for LListCounter functions and LAST that can be used to specify the end of the LList for LListCounter functions. The public functions of this module are: MakeLList(): Returns a new empty LList, i.e. an LList data structure containing no elements. MakeLLists(n): Given n, represented as an integer, this function returns an array of n new empty LList's (represented as a pointer to the first element of the array). PreallocateLListNodes(n): To avoid the inefficiency of system calls to allocate and deallocate memory used for the LListNode's making up a linked list of the LList type, whenever an LListNode is disused it is added to a (l)list of disused LListNode's. When a new LListNode is needed it is first checked whether an old LListNode can be recycled, and only if this is not the case is new memory allocated for the LListNode. This function allows preallocation of n, where n is an integer, LListNodes that are inserted into the list of disused LListNodes. The n LListNodes are individually allocated, so the only possible benefit of using this function is that the LListNodes are likely to be allocated contiguously in memory. FreeUnusedLListNodes(): Frees the memory taken up by the disused LListNode's in the list of LListNode's that can be recycled. After a call to this function the list of disused LListNode's will be empty. DestroyLList(llist, destroy_elm): Given llist, a doubly linked list represented as a pointer to an LList structure, all the memory used for storing the linked list, i.e. the LListNodes are deallocated and not inserted in the list of LListNode's that can be recycled. If not NULL, the function destroy_elm, that should take one void pointer as argument and have no return value, is applied to the element stored at each node of the linked list. DestroyLLists(llist, n, destroy_elm): Similar to DestroyLList, but assumes that llist is an array of n, where n is an integer, LList's (e.g. allocated using MakeLLists); llist should be a pointer to the start of the array of LList's. The use and type of destroy_elm is identical to that of destroy_elm for DestroyLList. DeleteLList(llist, delete_elm): Similar to DestroyLList but the LListNode's going into disuse are stored in the list of disused LListNode's, i.e. the memory allocated to these LListNode's is not deallocated. The use and type of the function delete_elm is identical to that of destroy_elm for DestroyLList. DeleteLLists(llist, n, delete_elm): Similar to DestroyLLists, but as in DeleteLList the LListNode's are stored for recycling. The use and type of delete_elm is identical to that of destroy_elm for DestroyLList. Insert(llist, lnode, elm): Given a doubly linked list llist, represented as a pointer to an LList structure, a node lnode llist, represented as a pointer to an LListNode, and an element elm, represented as a void pointer, elm is inserted immediately after lnode in llist. Remove(llist, lnode): Given a doubly linked list llist, represented as a pointer to an LList structure, and a node lnode, represented as a pointer to an LListNode, in llist, the node is removed from llist and the element stored in the node is returned as a void pointer. Push(llist, elm): Given a doubly linked list llist, represented as a pointer to an LList structure, and an element elm, represented as a void pointer, elm is inserted at the beginning of llist (i.e. is `pushed' on the list). Pop(llist): Given a doubly linked list llist, represented as a pointer to an LList structure, the first element of llist is removed from the list and returned as a void pointer (i.e. the first element is `popped' from the list). Used in combination with Push this allows the use of an LList as a stack. If llist is empty, NULL is returned. Top(llist): Given a doubly linked list llist, represented as a pointer to an LList structure, the first element of llist is returned as a void pointer (NULL if llist is empty), but contrary to Pop the element is not removed from llist. Enqueue(llist, elm): Given a doubly linked list llist, represented as a pointer to an LList structure, and an element elm, represented as a void pointer, elm is inserted at the end of llist. Used in combination with Pop this allows the use of an LList as a queue. Dequeue(llist): Given a doubly linked list llist, represented as a pointer to an LList structure, the last element of llist is removed from the list and returned as a void pointer (NULL if llist is empty). Used in combination with Push this allows the use of LList as a queue. Somewhat confusingly, if used in combination with Enqueue this will implement an LList based stack and not a queue. Bottom(llist): Given a doubly linked list llist, represented as a pointer to an LList structure, the last element of llist is returned as a void pointer (NULL if llist is empty), but contrary to Dequeue the element is not removed from llist. LListMap(llist, f, ...): Given a doubly linked list llist, represented as a pointer to an LList structure, a function f, that should take a void pointer and a list of arguments represented as a va_list (see documentation on stdarg of the standard C library for further information on variable argument lists) and a variable list of arguments, the function f is applied to each element of llist in turn. Apart from passing the elements stored in llist, represented as void pointers, to f, the extra arguments given to LListMap is also passed to f as a va_list. The initial call to va_start to set up the va_list of extra arguments is performed by LListMap before each call to f. Similarly, va_end is invoked when control returns to LListMap after a call to f. So if args denotes the va_list and two extra arguments, an integer that should be referred to as i within f and a char pointer that should be referred to as s within f, are supplied to LListMap, the declarations of f should contain the code int i = va_arg(args, int); char *s = va_arg(args, char *); to retrieve the extra arguments from the va_list. MakeCounter(llist, pos): Allocates memory for an iterator object that allows navigating through a doubly linked list llist, represented as a pointer to an LList structure, and initialises the iterator object to the element at position pos, where pos is an integer. Two macros defined in llist.h, FIRST and LAST, denotes the beginning and end of llist, i.e. using FIRST as position a subsequent call to Next (see below) will return the first element stored in llist and using LAST as position a subsequent call to Prev (see below) will return the last element stored in llist. The iterator object is returned as a pointer to an LListCounter structure. InitCounter(lcounter, llist, pos): Initialises an already allocated iterator object lcounter, represented as a pointer to an LListCounter structure, to refer to the object at position pos, where pos is an integer or FIRST or LAST. SetCounter(lcounter, pos): Updates an iterator object lcounter, represented as a pointer to an LListCounter structure, to refer to the element at position pos, where pos is an integer or FIRST or LAST. The element at this position is returned as a void pointer. Next(lcounter): Updates an iterator object lcounter, represented as a pointer to an LListCounter structure, to refer to the element succeeding the current element in the doubly linked list, and returns this element as a void pointer. When the end of the doubly linked list is reached, subsequent calls to Next will return NULL and the iterator object will remain at the end of the list. Prev(lcounter): Updates an iterator object lcounter, represented as a pointer to an LListCounter structure, to refer to the element preceding the current element in the doubly linked list, and returns this element as a void pointer. When the start of the doubly linked list is reached, subsequent calls to Prev will return NULL and the iterator object will remain at the start of the list. RemoveMoveLeft(llist, lcounter): Removes the element of the current position of the iterator object lcounter, represented as a pointer to an LListCounter structure, from the doubly linked list llist, represented as a pointer to an LList structure, and updates lcounter to refer to the element preceding the element just removed in llist. The removed element is returned as a void pointer. RemoveMoveRight(llist, lcounter): Identical to RemoveMoveLeft, except for lcounter being updated to refer to the element succeeding the element just removed in llist. Length(llist): Returns the length of the doubly linked list llist, represented as a pointer to an LList structure, as an integer. Module dependency: wrappers splay.c/splay.h This module implements a search tree data structure, together with a variety of methods for manipulating this data structure. The data structure is implemented using splay trees that have efficient amortised update times. The module defines the type splaytree that is the search tree type. The data structure supports augmenting values, which are assumed to be part of the items stored in the tree. E.g. assume that the tree is used to store integer numbers and should be augmented with the sum of the numbers stored. Then the type of items inserted in the tree should not be integers but structures containing two integers (or simply an array of two integers), the actual integer stored and an integer used for the augmentation. The advantage of augmentation is that in this situation e.g. the sum of all numbers stored in the tree that are smaller than a given number can be computed efficiently (in amortised logarithmic time). The public functions of this module are: splay_preallocatenodes(n): As for the doubly linked lists implemented in llist, the structures (splaynode's) used to store individual elements and references to father and sons in the tree are not deallocated when disused, but stored in a list of splaybnode's that can be recycled. This function allows preallocation of n, where n is an integer, splaynode's, and as for PreallocateLListNodes in llist the only possible advantage of invoking this function is that the splaynode's are likely to occupy a contiguous chunk of memory. splay_freeunusednodes(): Deallocates the memory used for the splaynode's in the list of disused splaynode's. After a call to this function the list of disused splaynode's will be empty. splay_create(less_than, update): Creates an empty search tree that is returned as a pointer to a splaytree structure. The two functions less_than (that should take two items, represented by void pointers, as arguments and return a non-zero integer if the first item is less than the second item, otherwise the integer 0) and update (that should take three items, represented by void pointers, as arguments), are stored as part of the tree. The less_than function defines the ordering of items stored in the tree, i.e. it is invoked to determine the ordering between items. The update function is used for maintaining augmenting values in the tree - the first item is the item at the splaynode where we need to update the augmenting value and the other two items are those at the roots of the left and right subtrees of this splaynode (NULL if that particular subtree is empty). Assume as in the above that the augmentation is by sum of integer numbers; then update should set the augmenting integer of the first argument passed to it to the sum of the item of the first argument and the augmenting integers of the remaining non-NULL arguments. If update is NULL no augmentation computations are performed. splay_deletetree(tree, delete_item): Given tree, a search tree represented as a pointer to a spalytree structure, all the splaynode's used for storing the search tree, are inserted in the list of splaynode's that can be recycled. If not NULL, the function delete_elm, that should take one void pointer as argument and have no return value, is applied to the element stored at each node of the search tree. splay_destroytree(tree, destroy_item): Identical to splay_deletetree, except that the memory used for the splaynode's used to store the items of the tree is deallocated instead of inserting these splaynode's in the list of splaynode's that can be recycled. The type and function of destroy_item is identical to that of delete_item in spaly_deletetree. splay_insert(tree, item): Insert item, represented by a void pointer, into the search tree tree that should be a pointer to a splaytree structure. If item is already present in the tree, i.e. there is an element in the tree that is not less than item (as reported by the less_than function of the tree) and that item is not less than, the item is not inserted in the tree. I.e. this implementation does not allow for storing duplicates. splay_delete(tree, item): If item, represented as a void pointer, is in the search tree tree, represented as a pointer to a splaytree structure, it is removed from the tree and the deleted item is returned as a void pointer. splay_deleterange(tree, item1, less_than1, item2, less_than2, delete_item): Deletes all items in tree that are not less than item1 using less_than1 for comparison, and that item2 is not less than using less_than2 for comparison. The search tree tree should be represented as a pointer to a splaytree structure, item1 and item2 should be represented as void pointers, and less_than1 and less_than2 should be function taking two items, represented as void pointers, as arguments and return a non-zero integer if the first item is less than the second item and otherwise the integer 0. All deleted items are processed by delete_item (if delete_item is not NULL), that should be a function taking a void pointer as argument and have no return value. Consistency of less_than1, less_than2, and the ordering of the tree is assumed, and if either is NULL the less_than function of the tree is used for comparison. The main reason for allowing the user to supply new comparison functions is to make it user definable whether an item equal to item1 and/or item2 is included in the range deleted. E.g. to not delete an item equal to item1 less_than1 should return a non-zero integer when the two items compared are in fact equal. The user specified comparison functions also make it possible to delete ranges based on augmenting values. splay_member(tree, item): Returns the integer 1 if item, represented as a void pointer, is in the search tree tree, represented as a pointer to a splaytree structure. splay_access(tree, item): If item, represented as a void pointer, is in the search tree tree, represented as a pointer to a splaytree structure, the copy stored in tree, represented as a void pointer, is returned. If item is not in tree NULL is returned. splay_accesssmaller(tree, item): The largest element in the search tree tree, represented as a pointer to a splaytree structure, that item, represented as a void pointer, is not less than is returned as a void pointer. If item is less than all elements in tree NULL is returned. splay_accesslarger(tree, item): Similar to splay_accesssmaller, but the smallest element of tree not less than item is returned, with NULL returned if all elements of tree are less than item. splay_evaluate(tree, item, combine): Evaluates the augmenting values of the search tree tree, represented as a pointer to a splaytree structure, relative to item, that should be a void pointer. The function combine, that should take three items, represented by void pointers, as argument and return a void pointer is used to combine the augmenting value from a subtree containing all elements of tree less than item (i.e. the item in the root of this subtree is passed as the first argument to combine; if tree contains no elements less than item NULL is passed as first argument), the item itself if it is present in the tree (otherwise NULL is passed as second argument), and the augmenting value from a subtree containing all element of the tree that item is less than (i.e. the item in the root of this subtree is passed as the third argument to combine; if tree contains no elements that item is less than NULL is passed as third argument). Again assuming that a tree storing integer numbers has been augmented with the sum of all numbers stored in each subtree, to obtain the sum of all numbers stored in the tree that are less than or equal to a given number splay_evaluate could be used with a combine function that returns a pointer to an integer containing the sum of the augmenting value of its first argument and item value of its second argument. splay_join(tree1, tree2): Given two trees tree1 and tree2, represented as pointers to splaytree structures, that have compatible orderings (and augmenting values if present) and where the largest element in tree1 is less than the smallest element in tree2, all elements of tree2 are added to tree1 and tree2 is deallocated. tree1 is returned as a pointer to a splaytree structure. splay_split(tree, item): All elements in the search tree tree, represented as a pointer to a splaytree structure, that item, represented as a void pointer, is less than are removed from tree and inserted into a new tree. The new tree inherits the less_than comparison function and the update augmenting function from tree and is returned as a pointer to a splaytree structure. splay_min(tree): Returns the smallest element stored in the search tree tree, represented as a pointer to a splaytree structure, as a void pointer. If tree is empty NULL is returned. splay_max(tree): Returns the largest element stored in the search tree tree, represented as a pointer to a splaytree structure, as a void pointer. If tree is empty NULL is returned. splay_deletemin(tree): Similar to splay_min, but the item returned is also deleted from tree. splay_deletemax(tree): Similar to splay_max, but the item returned is also deleted from tree. splay_empty(tree): Returns the integer 1 if the search tree tree, represented as a pointer to a splaytree structure, is empty and otherwise returns the integer 0. splay_printtree(tree, print_item): Given a search tree tree, represented as a pointer to a splaytree structure, and a function print_item that takes an item, represented as a void pointer, as argument and has no return value, tree is printed on stdout, using print_item to output each item in the tree and using parentheses to represent the tree structure. splay_listtreeelements(tree, print_item, separator): Given a search tree tree, represented as a pointer to a splaytree structure, a function print_item that takes an item, represented as a void pointer, as argument and has no return value, and a separator string, the elements in tree are printed to stdout with separator separating consecutive elements printed. The elements are listed in increasing order. check_consistency(t): This function is mainly intended for debugging. Given a search tree t, represented as a pointer to a splaytree structure, the tree is traversed to make sure that all sons refer back to their true fathers. It does not check for cyclic substructures in the tree (something that would result e.g. from joining a tree with itself) so it might loop infinitely. Module dependency: wrappers mergesort.c/mergesort.h This module implements merge sorting, an O(n log(n)) sorting algorithm. This algorithm works by repeatedly merging two sorted arrays into one sorted array containing all the elements of the two original arrays, starting with each element of the array of numbers to be sorted as an individual array with one element. The implementation should be quite efficient. The public function of this module is: mergesort(data, n, size, less_than): Given an array data, represented as a pointer to the first element of the array, of n, where n is of type size_t, elements, each taking up size bytes of memory in data, where size is of type size_t, and a comparison function less_than, that should take two elements, represented as void pointers, and return a non-zero integer if the first element is less than the second element and otherwise the integer 0, the elements in data are swapped to obey the ordering of less_than. I.e. after the call to mergesort no element of data will be to the right of another element that it is less than, as reported by less_than. The sorting maintains the ordering of equal elements, i.e. elements where neither is reported to be less than the other by less_than. Module dependency: wrappers rankselect.c/rankselect.h This module implements linear time selection of an element of a given rank in an unsorted array. This is done by splitting the elements of the array into groups of five, recursively finding the median of the medians of each group of five elements, and split all elements according to this median of medians, i.e. into a group of elements smaller than the median of medians , a group of elements equal to the median of medians and a group of elements larger than the median of medians. The two groups that do not contain the element of the desired rank are then discarded and the method is applied recursively on the remaining group (unless the remaining group is the one containing the elements equal to the median of medians in which case we can return any element of this group). As at least 30% of all the elements will be less than or equal to the median of medians and the median of medians will be less than or equal to at least 30% of all the elements, the two recursions will be on 20% of all the elements (to find the median of medians) and at most 70% of all the elements (to find the element of desired rank in the remaining group), i.e. on a total of at most 90% of all the elements which lead to the linear time complexity. The public function of the module is: rankselect(data, k, n, size, less_than): Given an array data, represented as a pointer to the first element of the array, of n, where n is an integer, elements, each taking up size bytes of memoty in data, where size is of type size_t, and a comparison function less_than, that should take two elements, represented as void pointers, and return a non-zero integer if the first element is less than the second element and otherwise the integer 0, the index of an element of rank k, where k is an integer, is returned as an integer. I.e. if the return value is i, data[i] should be an element in data of rank k. The data array is not modified by a call to rankselect. Module dependency: wrappers diff_exp.c/diff_exp.h This module is used to simulate differential expression from a self-to-self hybridisation data set. The simulated differential expression values are drawn from a multiplicative combination of the empirical distribution observed and a distribution that trails off exponentially with distance to the original value. The interval of acceptable new values is divided into a given number of intervals of equal size and one of these is selected according to the number of observed values in the interval and the interval's distance to the original value; a new value is drawn uniformly at random from the selected interval. Efficiency in the selection of an interval (O(log(n)) time where n is the number of intervals) should be guaranteed by initial prefix and postfix computations for the array of individual interval probabilities. The module defines one type, de_dist, for representing a distribution from which new expression values can be drawn. The public functions of the module are: de_init(n, min, max, a): Given an interval of permissible simulated expression values, defined by the doubles min and max giving the left respectively right end points of the interval, a number n, where n is an integer, of subintervals, and a base a, where a is a double, for the exponential trail off (the probability of sampling from a specific subinterval is proportional to the number of observed expression values in that interval multiplied by a raised to the distance to the original expression value), a structure from which differential expression can efficiently be simulated is initialised. This structure is returned as a pointer to a de_dist structure. de_free(d): Given a differential expression simulation structure, represented as a pointer to a de_dist structure, the memory used for storing this structure is deallocated. de_adddate(m, x, d): Given the m, where m is an integer, observed expression values in the array x, where x is represented by a pointer to its first element, and a differential expression simulation structure d, represented by a pointer to a de_dist structure, the observed expression values are added to the empirical distribution d, and d is updated to allow efficient simulation. de_draw(x, min, dir, d): With x being the original expression value, represented as a double, and d being a simulated differential expression structure, represented by a pointer to a de_dist structure, a new simulated expression value that is at least min, where min is a double, away from x in the direction specified by dir is drawn from d. The direction should be an integer. If dir is non-positive the new value is allowed to be smaller than x and if dir is non-negative the new value is allowed to be larger than x; i.e. if dir is zero it is chosen at random, according to how much of the overall probability distribution that falls on either side of x at a distance of at least min, whether the new value should be smaller than x or larger than x. de_2ddraw(x, y, minx, miny, dir, dx, dy): This is similar to de_draw, but the gene is assumed to have expression values from two experiments and one of these (but not both) is selected for differential selection. If the integer dir is non-negative either the expression value for the first experiment is down-regulated or the expression value for the second experiment is up-regulated; which it is, is determined at random with probabilities proportional to the exponetially weighted observed expression counts of dx and dy. Similarly, if dir is non-positive either the expression value for the first experiment is up-regulated or the expression value for the second experiment is down-regulated. If dir is zero the direction is chosen at random according to the exponentially weighted observed expression counts for the two possibilities. The original expression values of the two experiments are x and y that should be doubles. A minimum change in expression level of minx or miny, where both are doubles, is imposed depending on whether the change will be to the expression value of the first or the second experiment. Which of the two simulated differential expression strcutures dx and dy, which should be represented as pointers to de_dist structures, that is used for drawing the new value depends on whether the value or the first experiment or the value of the second experiment is the one selected to be changed. Module dependencies: random_interface, wrappers latexplot.c/latexplot.h This module has been used during our experiments to generate LaTeX code for plotting data and inferred normalisation functions to obtain graphical illustrations of the performances of the different methods we have implemented and tested. Some might find it useful; others can with some justification claim that LaTeX is not the tool best suited for the job. For one thing, with standard memory allocations LaTeX is not able to plot several thousand points, so the plot area is divided into a grid and for each grid cell a circle with radius proportional to the number of points within the grid cell is drawn. Still, with a reasonably fine mesh (e.g. 50×50) LaTeX runs out of memory, so each grid column is drawn as an independent picture (of zero size so that the grid columns are superimposed to make up the full grid picture). The module defines one type, _LATEXPLOT_evalfunction, that is the type of the function plotted by latex_plotcurve. This type of function takes one double as argument, and returns a double that should be the value of the function at its argument. The public functions of the module are: latex_initialise(min1, max1, min2, max2, fp): Given four doubles, min1, max1, min2, max2, defining the upper and lower boundaries of a two dimensional plot, and a file pointer fp, preparations for outputting a new plot to fp are made. This plot should be finalised by a call to latex_finalise, and no other plot should be initiated until this plot is finalised. If fp is NULL the plot is output to stdout. latex_plotpoints(n, x, y, m): Given n, where n is an integer points, points represented by x and y, where both are arrays of n doubles represented by pointers to their first element, and an integer m, the plot area is divided into an m×m grid, the number of points of each grid cell is tallied, and a circle with radius proportional to the grid cell tally is drawn centered at the grid cell center for each grid cell with a non-zero tally. latex_plotcurve(eval_function, colour, steps): Given a function, that should be of type _LATEXPLOT_evalfunction, a colour, that should be a string containing the name of a colour accepted as linecolor parameter to \psline of the pstricks LaTeX package, and an integer steps, LaTeX code for drawing the function using linear interpolation between the function evaluated at step + 1 equidistantly spaced points ranging from the lefthand boundary to the righthand boundary of the plot area is output. The function is truncated so as not to exceed the upper and lower boundaries of the plot area, and the colour of the line segments is set to colour. latex_finalise(): The plot is finalised by outputting an empty picture of size equal to the plot area. All pictures output by latex_plotpoints and latex_plotcurve have size zero to allow superimposing several point and curve plots, so the plot needs to be finalised by a call to this function to reserve space for it. The size of the plot area will be 500 times \unitlength on its long side. Module dependency: wrappers; the LaTeX code generated by latex_plotcurve assumes the pstricks package gmp_interface.c/gmp_interface.h The GNU Multiprecission library (GMP) can be used for computing reliabilities using rational numbers, eliminitating the risk of rounding errors throwing the computation off. Working with rational numbers, however, is not overly fast so all GMP calls are through functions in this module. Hence, unless USE_GMP is defined this module will be compiled to use doubles for the reliability computation, otherwise the GMP rational numbers type mpq_t is used. The module defines one type, mp_type, that is the type used for reliability values. The public functions of this module are (see the GMP documentation for further description of these functions): gmp_init(X): Initialises a variable X of type mp_type. This function (or gmp_init_set_ui) is invoked before X is utilised for the first time. For the GMP library mpq_t type this is done to allocate extra memory needed to store variables of this type. gmpi_clear(X): This function is invoked after the last utilisation of its parameter, a variable of type mp_type, to deallocate memory used by this variable etc. gmpi_set(ROP, OP): Given two arguments ROP and OP of type mp_type, ROP is set equal to the value of OP. gmpi_set_ui(ROP, OP): Given a variable ROP, of type mp_type, and an unsigned long integer OP, ROP is set equal to OP. gmpi_set_si(ROP, OP1, OP2): Given a variable ROP, of type mp_type, a signed long integer OP1, and an unsigned long integer OP2, ROP is set equal to OP1 / OP2. gmpi_set_d(ROP, OP): Given a variable ROP, of type mp_type, and a double OP, ROP is set equal to OP. gmpi_init_set_ui(ROP, OP): Given a variable ROP, of type mp_type, and an unsigned long integer OP, ROP is first initialised and then set equal to OP. gmpi_add(SUM, OP1, OP2): Given three variables SUM, OP1 and OP2, all of type mp_type, SUM is set equal to the sum of OP1 and OP2. gmpi_inc(OP): Given a variable OP, of type mp_type, OP is incremented, i.e. it is set equal to one plus its old value. gmpi_add(SUM, OP1, OP2): Given three variables SUM, OP1 and OP2, all of type mp_type, SUM is set equal to the difference between OP1 and OP2. gmpi_inv(mp_type OP): Given a variable OP, of type mp_type, OP is inverted, i.e. it is set equal to one divided by its old value. gmpi_mul(ROP, OP1, OP2): Given three variables ROP, OP1 and OP2, all of type mp_type, ROP is set equal to the product of OP1 and OP2. gmpi_div(ROP, OP1, OP2): Given three variables ROP, OP1 and OP2, all of type mp_type, ROP is set equal to OP1 divided by OP2. gmpi_get_d(OP): Given a variable OP, of type mp_type, a double approximating its value is returned. gmpi_print(OP): Given a variable OP, of type mp_type, it is output to stdout. Module dependencies: wrappers; the module further needs the GMP library to be installed if USE_GMP is defined random_interface.c/random_interface.h It is well known that the pseudo-random functions that are part of the C standard library are not always adequate. We doubt that this has caused any problems in our experiments, but for people who want to supply their own (pseudo-)random functions to the differential expression simulation of the diff_exp module we have made sure that all randomisations are through this module. Hence, to replace the standard library pseudo-random function, all that is required is to rewrite this module. The public funcions of the module are: ri_uniformint(a, b): Given integers a and b, an integer at least as large as a and smaller than b is chosen uniformly at random and returned as an integer. If b is not larger than a, a is returned. ri_uniformdouble(a, b): Given doubles a and b, a double at least as large as a and smaller than b is chosen uniformly at random and returned as a double. If b is not larger than a, a is returned. No module dependencies wrappers.c/wrappers.h The malloc function of the C standard library just returns a NULL pointer if the memory allocation was unsuccessful. In our applications of the modules described here, the only sensible thing to do in the situation of an unsuccessful memory allocation attept is to exit the program. So to avoid having to check for a non-NULL return value from malloc numerous places an interface wmalloc to malloc that checks the return value and exits on NULL is implemented in wrappers. Furthermore, during development it a light weight memory profiling turned out to be useful. If the macro MPROFILING is defined all addresses returned from wmalloc are also inserted in a search tree and removed again when the memory is deallocated using the wfree function. At any point all remaining allocated and not yet deallocated addresses can be listed. The public functions of the module are: wmalloc(n): Tries to allocate n, where n is of type size_t, bytes of memory and, if successful, return a void pointer to the first byte of the block of memory allocated. If unsuccessful the program is exited with exit value 1. If MPROFILING is defined the type is wmalloc(n, id); n should still be of type size_t and denote the number of bytes that should be allocated, but the functionality is extended with inserting the allocation address together with id, that should be a string, in a search tree containing all allocated and not yet deallocated addresses. wfree(adr): Deallocates the memory allocated starting at address adr, that should be a void pointer. If MPROFILING is defined it further checks whether the address does indeed refer to memory that has been allocated using wmalloc and not yet been deallocated using wfree. This makes it easy to catch e.g. repeated deallocations of the same memory. print_allocated(): This function is only defines if MPROFILING is defined. It outputs a comma separated list of the id strings of all memory blocks allocated and not yet deallocated stored in the search tree, and thus makes it easy to pinpoint the source of a memory leak, i.e. allocated memory not being deallocated when it is no longer needed. The modules of this library have all been tested with this light weight memory profiling to eliminate memory leaks and multiple deallocations. Module dependency if MPROFILING is defined: splay Simple Front End ---------------- To make the code useful for non-developers as well, a simple front end program supplying a command line interface to the various invariant set and normalisation function inference modules has been implemented. It is the default choice of the Makefile. Once the program has been compiled, running it wiht option ? or h prints a short usasge description. It can also be seen as a small example of how to use the modules of this package. istvan.c/istvan.h This module ties the other modules together in a command line program for gene expression data normalisation. The program contains its own usage description that is printed when running the programs with options ? or h. Three macros, _ISTVAN_SMOOTHWEIGHT, _ISTVAN_CONVERGENCETHRESHOLD, and _ISTVAN_LOESSINTERVALS controls some details of the program behaviour. When inferring a polynomial function, a term equal to the integral of the second derivative squared in a specified interval can be added to penalise large changes in slope. _ISTVAN_SMOOTHWEIGHT is weight assigned to this term with 1 - _ISTVAN_SMOOTHWEIGHT assigned to the sum of squared deviations. The other two macors are related to convergence determination for iterated global normalisation. When the integral of squared differences between the two normalising function inferences of two consecutive iterations is smaller than _ISTVAN_CONVERGENCETHRESHOLD, convergence is deemed attained. There is no straightforward way to compute the integral of squared differences between two loess fits, so for loess fits the integral is approximated by converting them to approximate loess fits using _ISTVAN_LOESSINTERVALS intervals and computing the integral of squared differences between these. The module declares no public functions, but defines the main function that is compiled into the command line program. Module dependencies: polynomial, loess, iploess, absolute_rank_selection, iterated_ars, relative_rank_selection, reliability, iterated_global_normalisation, llist, splay, wrappers, gmp_interface istvan.py The C front end assumes that the data exists in a simple white space separated format. Usually expression array is available in a format that either is, or can be easily converted to, a table like format, where each gene is represented by a line consisting of multiple columns. Apart from columns holding the measured intensities, columns containing flags, background intensities, etc. are also present. This Python script allows expression data in table based format to be parsed, genes to be selected based on the contents of specific columns, genes to be grouped according to the contents of other columns, etc. The intensities of the selected genes are then passed groupwise to the simple C front end for normalisation, and the normalised values are output to the specified output file in a table format identical to the original. Note that running this script can take considerable more time than running the simple C front end, as Python is (usually) an interpreted language.