# HG changeset patch # User Paul Boddie # Date 1251573347 -7200 # Node ID b81c00a48c4984c230312085b7234684f520fb98 # Parent fe7ed6b96612e2f0fd90cb13690e9b317f0383fe Introduced conditional compression for fields using bzip2 and zlib compression. Added an IndexReader class to encapsulate all reading operations (using term and field dictionaries). Added field-related file operations to the IndexWriter class. Added field-related file initialisation to the Index class. Changed the field index format to use offset deltas. diff -r fe7ed6b96612 -r b81c00a48c49 iixr.py --- a/iixr.py Sat Aug 29 02:15:29 2009 +0200 +++ b/iixr.py Sat Aug 29 21:15:47 2009 +0200 @@ -22,12 +22,15 @@ from os.path import exists, join from os.path import commonprefix # to find common string prefixes from bisect import bisect_right # to find terms in the dictionary index -import bz2 # for field compression +import bz2, zlib # for field compression # Constants. INTERVAL = 100 +compressors = [("b", bz2.compress), ("z", zlib.compress)] +decompressors = {"b" : bz2.decompress, "z" : zlib.decompress} + # Foundation classes. class File: @@ -94,7 +97,20 @@ # Compress the string if requested. if compress: - s = bz2.compress(s) + for flag, fn in compressors: + cs = fn(s) + + # Take the first string shorter than the original. + + if len(cs) < len(s): + s = cs + break + else: + flag = "-" + + # Record whether compression was used. + + self.f.write(flag) # Write the length of the data before the data itself. @@ -137,13 +153,21 @@ 'decompress' is set to a true value. """ + # Decompress the data if requested. + + if decompress: + flag = self.f.read(1) + else: + flag = "-" + length = self.read_number() s = self.f.read(length) - # Decompress the data if requested. + # Perform decompression if applicable. - if decompress: - s = bz2.decompress(s) + if flag != "-": + fn = decompressors[flag] + s = fn(s) # Convert strings to Unicode objects. @@ -532,7 +556,7 @@ # Write the fields themselves. for field in fields: - self.write_string(field, 0) # compress + self.write_string(field, 1) # compress self.last_docnum = docnum return offset @@ -565,7 +589,7 @@ i = 0 while i < nfields: - fields.append(self.read_string(0)) # decompress + fields.append(self.read_string(1)) # decompress i += 1 return self.last_docnum, fields @@ -589,6 +613,7 @@ def reset(self): self.last_docnum = 0 + self.last_offset = 0 def write_document(self, docnum, offset): @@ -597,12 +622,13 @@ document are stored in the fields file. """ - # Write the document number delta and offset. + # Write the document number and offset deltas. self.write_number(docnum - self.last_docnum) - self.write_number(offset) + self.write_number(offset - self.last_offset) self.last_docnum = docnum + self.last_offset = offset class FieldIndexReader(FileReader): @@ -610,6 +636,7 @@ def reset(self): self.last_docnum = 0 + self.last_offset = 0 def read_document(self): @@ -618,9 +645,9 @@ # Read the document number delta and offset. self.last_docnum += self.read_number() - offset = self.read_number() + self.last_offset += self.read_number() - return self.last_docnum, offset + return self.last_docnum, self.last_offset class FieldDictionaryWriter: @@ -706,11 +733,15 @@ class IndexWriter: - "Building term information and writing it to the term dictionary." + """ + Building term information and writing it to the term and field dictionaries. + """ - def __init__(self, dict_writer): + def __init__(self, dict_writer, field_dict_writer): self.dict_writer = dict_writer + self.field_dict_writer = field_dict_writer self.terms = {} + self.docs = {} def add_position(self, term, docnum, position): @@ -731,6 +762,15 @@ doc.append(position) + def add_fields(self, docnum, fields): + + "Add for the document with the given 'docnum' a list of 'fields'." + + if not self.docs.has_key(docnum): + doc_fields = self.docs[docnum] = fields + else: + self.docs[docnum] += fields + def close(self): if self.dict_writer is None: return @@ -748,6 +788,35 @@ self.dict_writer.close() self.dict_writer = None + # Get the documents in order. + + docs = self.docs.items() + docs.sort() + + for docnum, fields in docs: + self.field_dict_writer.write_fields(docnum, fields) + + self.field_dict_writer.close() + self.field_dict_writer = None + +class IndexReader: + + "Accessing the term and field dictionaries." + + def __init__(self, dict_reader, field_dict_reader): + self.dict_reader = dict_reader + self.field_dict_reader = field_dict_reader + + def find_positions(self, term): + return self.dict_reader.find_positions(term) + + def get_fields(self, docnum): + return self.field_dict_reader.read_fields(docnum) + + def close(self): + self.dict_reader.close() + self.field_dict_reader.close() + class Index: "An inverted index solution encapsulating the various components." @@ -775,7 +844,15 @@ dict_writer = TermDictionaryWriter(info_writer, index_writer, positions_writer, interval) - self.writer = IndexWriter(dict_writer) + ff = open(join(self.pathname, "fields"), "wb") + field_writer = FieldWriter(ff) + + fif = open(join(self.pathname, "fields_index"), "wb") + field_index_writer = FieldIndexWriter(fif) + + field_dict_writer = FieldDictionaryWriter(field_writer, field_index_writer, interval) + + self.writer = IndexWriter(dict_writer, field_dict_writer) return self.writer def get_reader(self): @@ -794,7 +871,17 @@ tpf = open(join(self.pathname, "positions"), "rb") positions_reader = PositionReader(tpf) - self.reader = TermDictionaryReader(info_reader, index_reader, positions_reader) + dict_reader = TermDictionaryReader(info_reader, index_reader, positions_reader) + + ff = open(join(self.pathname, "fields"), "rb") + field_reader = FieldReader(ff) + + fif = open(join(self.pathname, "fields_index"), "rb") + field_index_reader = FieldIndexReader(fif) + + field_dict_reader = FieldDictionaryReader(field_reader, field_index_reader) + + self.reader = IndexReader(dict_reader, field_dict_reader) return self.reader def close(self): diff -r fe7ed6b96612 -r b81c00a48c49 test.py --- a/test.py Sat Aug 29 02:15:29 2009 +0200 +++ b/test.py Sat Aug 29 21:15:47 2009 +0200 @@ -285,12 +285,16 @@ for docnum, text in docs: for position, term in enumerate(text.split()): wi.add_position(term, docnum, position) + wi.add_fields(docnum, [text]) wi.close() rd = index.get_reader() for term, doc_positions in doc_tests: dp = rd.find_positions(term) print doc_positions == dp, doc_positions, dp +for docnum, text in docs: + df = rd.get_fields(docnum) + print text == df[0], text, df[0] index.close() # vim: tabstop=4 expandtab shiftwidth=4