#!/usr/bin/env python

""" string_matrix -- approximate string matching across a set of strings

    This pair of functions allows approximate matching of one string, called
    a key,  to a set of strings, called a dictionary, as opposed to a
    traditional approach which approximately compares or matches against a
    single other string.  The method used was an original invention of mine,
    and may or may not be original in the larger world.
    
    The original motivation was to reduce the time in finding likenesses
    in paired sets of strings where the sets are of size n.  The basic
    approach is an O(n^2) problem, while this yields something close to an
    O(n) problem.

    METHOD

    The approach is based on building a large memory-intensive lookup
    datastructure or dictionary for a set of strings, which is retained over
    multiple key lookups.  The lookups return a set of scored matches which
    exceed an arbitrary cutoff.

    The key idea is considering a word as a set of letter pairs, or
    wordlets, which are positionally located in the word.   An arbitrary
    number of quantized positional values are chosen for implementation.
    Positions are determined by distance through the word divided by word
    length.  Comparisons are based on these letter-position pairs.
    
    Quantization is handled in _both_ directions, rounding both up and down
    simultaneously, to ensure that differently lengthed words will not
    fail to match due to rounding errors.

    For example the word 'ham', two wordlets exist: 'ha', and 'am'.  The
    word is three characters long, and the wordlets begin at locations 1 and
    2.  In my implementation I chose a set of 10 positional values, and thus
    they are slotted to location 1*10/3 and 2*10/3 -- yeilding 3.33333 and
    6.66666.  As no such slots exist, they are rounded both up and down,
    locating 'ha' at 3, and 4, while 'am' is placed in slots 6 and 7.

    This data is stored in a dictionary structure, keyed by wordlet, and
    containing one list of words per quantized location.  This means that
    adding 'ham' to the dictionary results in the following:
    dict = { 'ha': [[], [], [], ['ham'], ['ham'], [], [], [], [], []]
             'am': [[], [], [], [], [], ['ham'], ['ham'], [], [], []] }
    
    Later, a lookup can be performed similarly.  The key string is split
    into wordlets.  For each wordlet the matching words at both quantized
    locations are collected.  Words are scored based on their percentage of
    matches as compared to the total number of wordlets in the key string.

    MEMORY USE

    Python lists consume 4 bytes per entry, and thus 4 bytes per wordlet
    in the dictionary will be consumed.  In addition, approximately 188
    bytes are used per unique wordlet for the creation of the lists and the
    dictionary entry.  My working dataset consisted of around 100,000
    strings around 10 characters long, primarily text.  This equated to
    approximately 3.6MB of memory in list entries, and 18.8MB in structure.

    PERFORMANCE

    The lookup of wordlets is an (extremely)  simple hash and is thus
    performed in constant time, as is the selection of the quantized
    location.  However, the number of words to collect and process from
    these wordlet lookups grows linearly with the number of words in the
    dictionary.  This would seem to cause a single lookup to grow linearly
    with the size of the dataset, the same as a stupid search.  Fortunately,
    the constant multiplier for word processing is very small.  The work for
    a lookup is thus something like

        O(1 + kN) 

    where N is the size of the list and k is very small.  Attempts to
    optimize for this kN operation did not break even for data sets several
    orders of magnitude greater than the expected case.

    Converting from a linear search to this code when attempting to find
    near matches in significantly sized sets of strings -- 100,000 or so --
    reduced the execution time by a factor of over 500.
"""

def word_bits_hash(list):
    """ Create a word dicationary from a list of strings 
        
        arguments: list - list of strings
        returns:   dictionary for use with string_find
    """
    word_elements = {}
    for str in list:
        strlen = len(str) 
        for index in range(1, strlen):
            wordlet = str[index - 1: index + 1]
            (wordloc, rem) = divmod((index * 10), strlen)
            try:
                bucket = word_elements[wordlet][wordloc - 1]
                if not str in bucket:
                    bucket.append(str)
            except KeyError:
                word_elements[wordlet] = [[], [], [], [], [], [], [], [], [], []]
                word_elements[wordlet][wordloc - 1].append(str)
            if rem:
                try:
                    bucket = word_elements[wordlet][wordloc]
                    if not str in bucket:
                        bucket.append(str)
                except KeyError:
                    word_elements[wordlet] = [[], [], [], [], [], [], [], [], [], []]
                    word_elements[wordlet][wordloc].append(str)
    return word_elements

def string_find(str, word_hash):
    """ Find approximate matches to a string within a dictionary

        arguments: str -       string to approximately match
                   word_hash - dictionary containing words to match against
        returns:               tuple continaing:
                                    most closely matched word
                                    list of word/score pairs
    """
    match_candidates = {}

    strlen = len(str)
    for index in range(1, strlen):
        wordlet = str[index - 1: index + 1]
        (wordloc, rem) = divmod((index * 10), strlen)
        try:
            bucket = word_hash[wordlet][wordloc - 1]
        except KeyError:
            pass
        else:
            for match in bucket:
                if match_candidates.has_key(match):
                    match_candidates[match] += 1
                else: 
                    match_candidates[match] = 1
            continue
        if rem:
            try:
                bucket = word_hash[wordlet][wordloc]
                for match in bucket:
                    if match_candidates.has_key(match):
                        match_candidates[match] += 1
                    else: 
                        match_candidates[match] = 1
            except KeyError:
                pass

    required = 0.6
    matches = []
    for candidate in match_candidates.keys():
        mlen = len(candidate)
        if mlen < strlen:
            mlen = strlen
        value = match_candidates[candidate] / float(mlen - 1)
        if value >= required:
            matches.append((value, candidate))
    matches.sort()
    best = 0
    if matches:
        best = matches[-1][0]
    return (best, matches)