This file lists projects suitable for volunteers. Please see the tasks file for smaller tasks.
If you want to work on any of the projects below, please let gmp-devel@swox.com know. If you want to help with a project that already somebody else is working on, you will get in touch through gmp-devel@swox.com. (There are no email addresses of volunteers below, due to spamming problems.)
The current multiplication code uses Karatsuba, 3-way Toom, and Fermat FFT. Several new developments are desirable:
mpf_mul now, a low half partial product might be of use
         in a future sub-quadratic REDC.  On small sizes a partial product
         will be faster simply through fewer cross-products, similar to the
         way squaring is faster.  But work by Thom Mulders shows that for
         Karatsuba and higher order algorithms the advantage is progressively
         lost, so for large sizes partial products turn out to be no faster.
  Another possibility would be an optimized cube. In the basecase that should definitely be able to save cross-products in a similar fashion to squaring, but some investigation might be needed for how best to adapt the higher-order algorithms. Not sure whether cubing or further small powers have any particularly important uses though.
Write new and improve existing assembly routines. The tests/devel programs and the tune/speed.c and tune/many.pl programs are useful for testing and timing the routines you write. See the README files in those directories for more information.
Please make sure your new routines are fast for these three situations:
The most important routines are mpn_addmul_1, mpn_mul_basecase and mpn_sqr_basecase. The latter two don't exist for all machines, while mpn_addmul_1 exists for almost all machines.
Standard techniques for these routines are unrolling, software pipelining, and specialization for common operand values. For machines with poor integer multiplication, it is often possible to improve the performance using floating-point operations, or SIMD operations such as MMX or Sun's VIS.
Using floating-point operations is interesting but somewhat tricky. Since IEEE double has 53 bit of mantissa, one has to split the operands in small prices, so that no result is greater than 2^53. For 32-bit computers, splitting one operand into 16-bit pieces works. For 64-bit machines, one operand can be split into 21-bit pieces and the other into 32-bit pieces. (A 64-bit operand can be split into just three 21-bit pieces if one allows the split operands to be negative!)
We currently have extended GCD based on Lehmer's method. But the binary method can quite easily be improved for bignums in a similar way Lehmer improved Euclid's algorithm. The result should beat Lehmer's method by about a factor of 2. Furthermore, the multiprecision step of Lehmer's method and the binary method will be identical, so the current code can mostly be reused. It should be possible to share code between GCD and GCDEXT, and probably with Jacobi symbols too.
Work on Schöhage GCDEXT for large numbers is in progress.
Implement the functions of math.h for the GMP mpf layer! Check the book "Pi and the AGM" by Borwein and Borwein for ideas how to do this. These functions are desirable: acos, acosh, asin, asinh, atan, atanh, atan2, cos, cosh, exp, log, log10, pow, sin, sinh, tan, tanh.
These are implemented in mpfr, redoing them in mpf might not be worth bothering with, if the long term plan is to bring mpfr in as the new mpf.
The current code uses divisions, which are reasonably fast, but it'd be possible to use only multiplications by computing 1/sqrt(A) using this formula:
                                    2
                   x   = x  (3 - A x )/2.
                    i+1   i         i  
  And the final result
  
                     sqrt(A) = A x .
                                  n  
  That final multiply might be the full size of the input (though it might
  only need the high half of that), so there may or may not be any speedup
  overall.
Improve mpn_rootrem. The current code is really naive, using full precision from the first iteration. Also, calling mpn_pow_1 isn't very clever, as only 1/n of the result bits will be used; truncation after each multiplication would be better. Avoiding division might also be possible.
If the routine becomes fast enough, perhaps square roots could be computed using this function.
Some work can be saved when only the quotient is required in a division, basically the necessary correction -0, -1 or -2 to the estimated quotient can almost always be determined from only a few limbs of multiply and subtract, rather than forming a complete remainder. The greatest savings are when the quotient is small compared to the dividend and divisor.
 Some code along these lines can be found in the current
      mpn_tdiv_qr, though perhaps calculating bigger chunks of
      remainder than might be strictly necessary.  That function in its
      current form actually then always goes on to calculate a full remainder.
      Burnikel and Zeigler describe a similar approach for the divide and
      conquer case.
 mpn_bdivmod and the redc in
      mpz_powm should use some sort of divide and conquer
      algorithm.  This would benefit mpz_divexact, and
      mpn_gcd on large unequal size operands.  See "Exact
      Division with Karatsuba Complexity" by Jebelean for a (brief)
      description.
  
 Failing that, some sort of DIVEXACT_THRESHOLD could be
      added to control whether mpz_divexact uses
      mpn_bdivmod or mpn_tdiv_qr, since the latter
      is faster on large divisors.
  
For the REDC, basically all that's needed is Montgomery's algorithm done in multi-limb integers. R is multiple limbs, and the inverse and operands are multi-precision.
 For exact division the time to calculate a multi-limb inverse is not
      amortized across many modular operations, but instead will probably
      create a threshold below which the current style
      mpn_bdivmod is best.  There's also Krandick and Jebelean,
      "Bidirectional Exact Integer Division" to basically use a low to high
      exact division for the low half quotient, and a quotient-only division
      for the high half.
  
It will be noted that low-half and high-half multiplies, and a low-half square, can be used. These ought to each take as little as half the time of a full multiply, or square, though work by Thom Mulders shows the advantage is progressively lost as Karatsuba and higher algorithms are applied.
 Some sort of scheme for exceptions handling would be desirable.
      Presently the only thing documented is that divide by zero in GMP
      functions provokes a deliberate machine divide by zero (on those systems
      where such a thing exists at least).  The global gmp_errno
      is not actually documented, except for the old gmp_randinit
      function.  Being currently just a plain global means it's not
      thread-safe.
  
 The basic choices for exceptions are returning an error code or having
      a handler function to be called.  The disadvantage of error returns is
      they have to be checked, leading to tedious and rarely executed code,
      and strictly speaking such a scheme wouldn't be source or binary
      compatible.  The disadvantage of a handler function is that a
      longjmp or similar recovery from it may be difficult.  A
      combination would be possible, for instance by allowing the handler to
      return an error code.
  
Divide-by-zero, sqrt-of-negative, and similar operand range errors can normally be detected at the start of functions, so exception handling would have a clean state. What's worth considering though is that the GMP function detecting the exception may have been called via some third party library or self contained application module, and hence have various bits of state to be cleaned up above it. It'd be highly desirable for an exceptions scheme to allow for such cleanups.
The C++ destructor mechanism could help with cleanups both internally and externally, but being a plain C library we don't want to depend on that.
 A C++ throw might be a good optional extra exceptions
      mechanism, perhaps under a build option.  For GCC
      -fexceptions will add the necessary frame information to
      plain C code, or GMP could be compiled as C++.
  
 Out-of-memory exceptions are expected to be handled by the
      mp_set_memory_functions routines, rather than being a
      prospective part of divide-by-zero etc.  Some similar considerations
      apply but what differs is that out-of-memory can arise deep within GMP
      internals.  Even fundamental routines like mpn_add_n and
      mpn_addmul_1 can use temporary memory (for instance on Cray
      vector systems).  Allowing for an error code return would require an
      awful lot of checking internally.  Perhaps it'd still be worthwhile, but
      it'd be a lot of changes and the extra code would probably be rather
      rarely executed in normal usages.
  
 A longjmp recovery for out-of-memory will currently, in
      general, lead to memory leaks and may leave GMP variables operated on in
      inconsistent states.  Maybe it'd be possible to record recovery
      information for use by the relevant allocate or reallocate function, but
      that too would be a lot of changes.
  
 One scheme for out-of-memory would be to note that all GMP allocations
      go through the mp_set_memory_functions routines.  So if the
      application has an intended setjmp recovery point it can
      record memory activity by GMP and abandon space allocated and variables
      initialized after that point.  This might be as simple as directing the
      allocation functions to a separate pool, but in general would have the
      disadvantage of needing application-level bookkeeping on top of the
      normal system malloc.  An advantage however is that it
      needs nothing from GMP itself and on that basis doesn't burden
      applications not needing recovery.  Note that there's probably some
      details to be worked out here about reallocs of existing variables, and
      perhaps about copying or swapping between "permanent" and "temporary"
      variables.
  
 Applications desiring a fine-grained error control, for instance a
      language interpreter, would very possibly not be well served by a scheme
      requiring longjmp.  Wrapping every GMP function call with a
      setjmp would be very inconvenient.
  
 Another option would be to let mpz_t etc hold a sort of
      NaN, a special value indicating an out-of-memory or other failure.  This
      would be similar to NaNs in MPFR.  Unfortunately such a scheme could
      only be used by programs prepared to handle such special values, since
      for instance a program waiting for some condition to be satisfied could
      become an infinite loop if it wasn't also watching for NaNs.  The work
      to implement this would be significant too, lots of checking of inputs
      and intermediate results.  And if mpn routines were to
      participate in this (which they would have to internally) a lot of new
      return values would need to be added, since of course there's no
      mpz_t etc structure for them to indicate failure in.
  
 Stack overflow is another possible exception, but perhaps not one that
      can be easily detected in general.  On i386 GNU/Linux for instance GCC
      normally doesn't generate stack probes for an alloca, but
      merely adjusts %esp.  A big enough alloca can
      miss the stack redzone and hit arbitrary data.  GMP stack usage is
      normally a function of operand size, which might be enough for some
      applications to know they'll be safe.  Otherwise a fixed maximum usage
      can probably be obtained by building with
      --enable-alloca=malloc-reentrant (or
      notreentrant).  Arranging the default to be
      alloca only on blocks up to a certain size and
      malloc thereafter might be a better approach and would have
      the advantage of not having calculations limited by available stack.
  
 Actually recovering from stack overflow is of course another problem.
      It might be possible to catch a SIGSEGV in the stack
      redzone and do something in a sigaltstack, on systems which
      have that, but recovery might otherwise not be possible.  This is worth
      bearing in mind because there's no point worrying about tight and
      careful out-of-memory recovery if an out-of-stack is fatal.
  
 Operand overflow is another exception to be addressed.  It's easy for
      instance to ask mpz_pow_ui for a result bigger than an
      mpz_t can possibly represent.  Currently overflows in limb
      or byte count calculations will go undetected.  Often they'll still end
      up asking the memory functions for blocks bigger than available memory,
      but that's by no means certain and results are unpredictable in general.
      It'd be desirable to tighten up such size calculations.  Probably only
      selected routines would need checks, if it's assumed say that no input
      will be more than half of all memory and hence size additions like say
      mpz_mul won't overflow.
Add a test suite for old bugs. These tests wouldn't loop or use random numbers, but just test for specific bugs found in the past.
More importantly, improve the random number controls for test collections:
With this in place, it should be possible to rid GMP of practically all bugs by having some dedicated GMP test machines running "make check" with ever increasing seeds (and periodically updating to the latest GMP).
If a few more ASSERTs were sprinkled through the code, running some tests with --enable-assert might be useful, though not a substitute for tests on the normal build.
An important feature of any random tests will be to record the seeds used, and perhaps to snapshot operands before performing each test, so any problem exposed can be reproduced.
It'd be nice to have some sort of tool for getting an overview of performance. Clearly a great many things could be done, but some primary uses would be,
A combination of measuring some fundamental routines and some representative application routines might satisfy these.
 The tune/time.c routines would be the easiest way to get good
      accurate measurements on lots of different systems.  The high level
      speed_measure may or may not suit, but the basic
      speed_starttime and speed_endtime would cover
      lots of portability and accuracy questions.
restrict
   There might be some value in judicious use of C99 style
      restrict on various pointers, but this would need some
      careful thought about what it implies for the various operand overlaps
      permitted in GMP.
  
 Rumour has it some pre-C99 compilers had restrict, but
      expressing tighter (or perhaps looser) requirements.  Might be worth
      investigating that before using restrict unconditionally.
  
 Loops are presumably where the greatest benefit would be had, by 
      allowing the compiler to advance reads ahead of writes, perhaps as part
      of loop unrolling.  However critical loops are generally coded in
      assembler, so there might not be very much to gain.  And on Cray systems
      the explicit use of _Pragma gives an equivalent effect.
  
 One thing to note is that Microsoft C headers (on ia64 at least) contain
      __declspec(restrict), so a #define of
      restrict should be avoided.  It might be wisest to setup a
      gmp_restrict.
The limb-by-limb dependencies in the existing Nx1 division (and remainder) code means that chips with multiple execution units or pipelined multipliers are not fully utilized.
 One possibility is to follow the current preinv method but taking two
      limbs at a time.  That means a 2x2->4 and a 2x1->2 multiply for
      each two limbs processed, and because the 2x2 and 2x1 can each be done
      in parallel the latency will be not much more than 2 multiplies for two
      limbs, whereas the single limb method has a 2 multiply latency for just
      one limb.  A version of mpn_divrem_1 doing this has been
      written in C, but not yet tested on likely chips.  Clearly this scheme
      would extend to 3x3->9 and 3x1->3 etc, though with diminishing
      returns.
  
 For mpn_mod_1, Peter L. Montgomery proposes the following
      scheme.  For a limb R=2^bits_per_mp_limb, pre-calculate
      values R mod N, R^2 mod N, R^3 mod N, R^4 mod N.  Then take dividend
      limbs and multiply them by those values, thereby reducing them (moving
      them down) by the corresponding factor.  The products can be added to
      produce an intermediate remainder of 2 or 3 limbs to be similarly
      included in the next step.  The point is that such multiplies can be
      done in parallel, meaning as little as 1 multiply worth of latency for 4
      limbs.  If the modulus N is less than R/4 (or is it R/5?) the summed
      products will fit in 2 limbs, otherwise 3 will be required, but with the
      high only being small.  Clearly this extends to as many factors of R as
      a chip can efficiently apply.
  
The logical conculsion for powers R^i is a whole array "p[i] = R^i mod N" for i up to k, the size of the dividend. This could then be applied at multiplier throughput speed like an inner product. If the powers took roughly k divide steps to calculate then there'd be an advantage any time the same N was used three or more times. Suggested by Victor Shoup in connection with chinese-remainder style decompositions, but perhaps with other uses.
     The removal of twos in the current code could be extended to factors of
      3 or 5.  Taking this to its logical conclusion would be a
      complete decomposition into powers of primes.  The power for a prime p
      is of course floor(n/p)+floor(n/p^2)+...  Conrad Curry found this is
      quite fast (using simultaneous powering as per Handbook of Applied
      Cryptography algorithm 14.88).
    A difficulty with using all primes is that quite large n can be
      calculated on a system with enough memory, larger than we'd probably
      want for a table of primes, so some sort of sieving would be wanted.
      Perhaps just taking out the factors of 3 and 5 would give most of the
      speedup that a prime decomposition can offer.
    An obvious improvement to the current code would be to strip factors of
      2 from each multiplier and divisor and count them separately, to be
      applied with a bit shift at the end.  Factors of 3 and perhaps 5 could
      even be handled similarly.
    Conrad Curry reports a big speedup for binomial coefficients using a
      prime powering scheme, at least for k near n/2.  Of course this is only
      practical for moderate size n since again it requires primes up to n.
    When k is small the current (n-k+1)...n/1...k will be fastest.  Some
      sort of rule would be needed for when to use this or when to use prime
      powering.  Such a rule will be a function of both n and k.  Some
      investigation is needed to see what sort of shape the crossover line
      will have, the usual parameter tuning can of course find machine
      dependent constants to fill in where necessary.
    An easier possibility also reported by Conrad Curry is that it may be
      faster not to divide out the denominator (1...k) one-limb at a time, but
      do one big division at the end.  Is this because a big divisor in
        Another reason a big divisor might help is that
            Non-powers can be quickly identified by checking for Nth power residues
      modulo small primes, like   Any zero remainders found in residue testing reveal factors which can be
      divided out, with the multiplicity restricting the powers that need to
      be considered, as per the current code.  Further prime dividing should
      be grouped into limbs like       GMP is not really a number theory library and probably shouldn't have
      large amounts of code dedicated to sophisticated prime testing
      algorithms, but basic things well-implemented would suit.  Tests
      offering certainty are probably all too big or too slow (or both!) to
      justify inclusion in the main library.  Demo programs showing some
      possibilities would be good though.
    The present "repetitions" argument to       The trial divisions are done with primes generated and grouped at
      runtime.  This could instead be a table of data, with pre-calculated
      inverses too.  Storing deltas, ie. amounts to add, rather than actual
      primes would save space.    Jason Moxham points out that if a sqrt(-1) mod N exists then any factor
      of N must be == 1 mod 4, saving half the work in trial dividing.  (If
      x^2==-1 mod N then for a prime factor p we have x^2==-1 mod p and so the
      jacobi symbol (-1/p)=1.  But also (-1/p)=(-1)^((p-1)/2), hence must have
      p==1 mod 4.)  But knowing whether sqrt(-1) mod N exists is not too easy.
      A strong pseudoprime test can reveal one, so perhaps such a test could
      be inserted part way though the dividing.
    Jon Grantham "Frobenius Pseudoprimes" (www.pseudoprime.com) describes a
      quadratic pseudoprime test taking about 3x longer than a plain test, but
      with only a 1/7710 chance of error (whereas 3 plain Miller-Rabin tests
      would offer only (1/4)^3 == 1/64).  Such a test needs completely random
      parameters to satisfy the theory, though single-limb values would run
      faster.  It's probably best to do at least one plain Miller-Rabin before
      any quadratic tests, since that can identify composites in less total
      time.
    Some thought needs to be given to the structure of which tests (trial
      division, Miller-Rabin, quadratic) and how many are done, based on what
      sort of inputs we expect, with a view to minimizing average time.
    It might be a good idea to break out subroutines for the various tests,
      so that an application can combine them in ways it prefers, if sensible
      defaults in   For small inputs, combinations of theory and explicit search make it
      relatively easy to offer certainty.  For instance numbers up to 2^32
      could be handled with a strong pseudoprime test and table lookup.  But
      it's rather doubtful whether a smallnum prime test belongs in a bignum
      library.  Perhaps if it had other internal uses.
    An     On various systems, calls within libgmp still go through the PLT, TOC or
      other mechanism, which makes the code bigger and slower than it needs to
      be.
    The theory would be to have all GMP intra-library calls resolved
      directly to the routines in the library.  An application wouldn't be
      able to replace a routine, the way it can normally, but there seems no
      good reason to do that, in normal circumstances.
    The   Unfortunately, on i386 it seems   The linker can be told directly (with a link script, or options) to do
      the same sort of thing.  This doesn't change the code emitted by gcc of
      course, but it does mean calls are resolved directly to their targets,
      avoiding a PLT entry.
    Keeping symbols private to libgmp.so is probably a good thing in general
      too, to stop anyone even attempting to access them.  But some
      undocumented things will need or want to be kept visibible, for use by
      mpfr, or the test and tune programs.  Libtool has a standard option for
      selecting public symbols (used now for libmp).
mpn_modexact_1_odd calculates an x in the range 0<=xmpn_mod_1.  modexact_1 is
      simpler and on some chips can run noticably faster than plain
      mod_1, on Athlon for instance 11 cycles/limb instead of 17.
      Such a difference could soon overcome the time to calculate b^n.  The
      requirement for an odd divisor in modexact can be handled
      by some shifting on-the-fly, or perhaps by an extra partial-limb step at
      the end.
mpn_bdivmod trades the latency of
      mpn_divexact_1 for the throughput of
      mpn_submul_1?  Overheads must hurt though.
  mpn_divexact_1 won't be getting a full limb in
      mpz_bin_uiui.  It's called when the n accumulator is full
      but the k may be far from full.  Perhaps the two could be decoupled so k
      is applied when full.  It'd be necessary to delay consideration of k
      terms until the corresponding n terms had been applied though, since
      otherwise the division won't be exact.
mpz_perfect_power_p could be improved in a number of ways,
      for instance p-adic arithmetic to find possible roots.
  mpn_perfect_square_p does for
      squares.  The residues to each power N for a given remainder could be
      grouped into a bit mask, the masks for the remainders to each divisor
      would then be "and"ed together to hopefully leave only a few candidate
      powers.  Need to think about how wide to make such masks, ie. how many
      powers to examine in this way.
  PP.  Need to think about how
      much dividing to do like that, probably more for bigger inputs, less for
      smaller inputs.
  mpn_gcd_1 would probably be better than the current private
      GCD routine.  The use it's put to isn't time-critical, and it might help
      ensure correctness to just use the main GCD routine.
mpz_probab_prime_p is
      rather specific to the Miller-Rabin tests of the current implementation.
      Better would be some sort of parameter asking perhaps for a maximum
      chance 1/2^x of a probable prime in fact being composite.  If
      applications follow the advice that the present reps gives 1/4^reps
      chance then perhaps such a change is unnecessary, but an explicitly
      described 1/2^x would allow for changes in the implementation or even
      for new proofs about the theory.
  mpz_probab_prime_p always initializes a new
      gmp_randstate_t for randomized tests, which unfortunately
      means it's not really very random and in particular always runs the same
      tests for a given input.  Perhaps a new interface could accept an rstate
      to use, so successive tests could increase confidence in the result.
  mpn_mod_34lsub1 is an obvious and easy improvement to the
      trial divisions.  And since the various prime factors are constants, the
      remainder can be tested with something like
#define MP_LIMB_DIVISIBLE_7_P(n) \
  ((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7)
      Which would help compilers that don't know how to optimize divisions by
      constants, and is even an improvement on current gcc 3.2 code.  This
      technique works for any modulus, see Granlund and Montgomery "Division
      by Invariant Integers" section 9.
  udiv_qrnnd_preinv style inverses
      can be made to exist by adding dummy factors of 2 if necessary.  Some
      thought needs to be given as to how big such a table should be, based on
      how much dividing would be profitable for what sort of size inputs.  The
      data could be shared by the perfect power testing.
  mpz_probab_prime_p don't suit.  In particular
      this would let applications skip tests it knew would be unprofitable,
      like trial dividing when an input is already known to have no small
      factors.
  mpz_nthprime might be cute, but is almost certainly
      impractical for anything but small n.
visibility attribute in recent gcc is good for this,
      because it lets gcc omit unnecessary GOT pointer setups or whatever if
      it finds all calls are local and there's no global data references.
      Documented entrypoints would be protected, and purely
      internal things not wanted by test programs or anything can be
      internal.
  protected ends up causing
      text segment relocations within libgmp.so, meaning the library code
      can't be shared between processes, defeating the purpose of a shared
      library.  Perhaps this is just a gremlin in binutils (debian packaged
      2.13.90.0.16-1).