一种基于Bigram二级哈希的中文索引结构

1. 基础知识：

1.1 中文GB2312编码

GB2312标准共收录6763个汉字，其中一级汉字3755个，二级汉字3008个，汉字覆盖99.75%的使用频率。同时采用区位码编码，每个汉字/符号用两个字节表示。

1.2 倒排索引

由词汇表和倒排列表组成。其中倒排列表包含该词汇在文本库中出现的文档号，文档内出现的次数，和绝对位置。

在中文倒排索引中，主要分为单字倒排索引，词倒排索引，和n-gram倒排索引。

其中，单字倒排索引优点是查全率高，但查找速度较慢，存储空间消耗较大，查准率较低。

词倒排索引，通常采用分词技术提取文本中的词。优点是检索速度快，查准率较高，存储空间消耗小。

n-gram倒排索引以固定长度为q的字符串做索引项。特点是不需要分词，容忍错误率高，具有语言独立性，不需用停用词，检索速率快，查准率较高。但其存储占用空间较大。

2. 中文Bigram二级哈希索引结构及创建。

2.1 中文Bigram二级哈希索引

自定义哈希函数：

可以把GB2312编码中每个汉字的区位码都映射到一个一维的整数空间。公式中，Q代表汉字的区码，W代表汉字的位码。计算出的哈希值的范围在0~6763。

在bigram项中，以第一个汉字的哈希值做为一级索引，以第二个汉字的哈希值作为二级索引。此结构能在O（2）时间内找到任何bigram项在文本库中的集合。

2.2 索引创建

class Linearmap(object):
    def __init__(self, size=6763):
        self.second_index = [[] for i in range(size)]

class HashTable(object):
    def __init__(self, size=6763):
        self.first_index = []
        for i in range(size):
            self.first_index.append(Linearmap())

    def add(self, first_hash_value, second_hash_value, inverse_table):
        m=self.first_index[first_hash_value]
        m.second_index[second_hash_value].append(inverse_table)

    def update(self, first_hash_value, second_hash_value, inverse_table):
        m = self.first_index[first_hash_value]
        m.second_index[second_hash_value] = inverse_table

def get_hash_value(bi_words):
    encode_word = bi_words.encode('gb2312')
    hash_word = ''.join(map(lambda x: str(ord(chr(x))-160), encode_word))
    Q = int(hash_word[0:2])
    W = int(hash_word[2:4])
    if 1 <= Q<= 55:
        hash_value = (Q-16)*94+W-1
    elif Q> 55:
        hash_value = (Q-16)*94+W-6
    return hash_value

def text_handlr(text, document_ID):
    text =text.replace('?', '').replace('。', '')
    step = 0
    len_text = len(text)
    for i in range(len(text)-1):
        absolute_address = []
        bigram_words = text[i:i + 2]
        first_hash_value = get_hash_value(bigram_words[0])
        second_hash_value = get_hash_value(bigram_words[1])
        inverse_table = hash_table.first_index[first_hash_value].second_index[second_hash_value]
        if inverse_table:
            d_x = inverse_table[-1][0]
        else:
            d_x = -1
        if d_x != document_ID:
            absolute_address.append(step)
            inverse_table.append([document_ID, len_text, absolute_address])
            hash_table.add(first_hash_value, second_hash_value, inverse_table)
            hash_table.add(first_hash_value, second_hash_value, text)
        elif d_x == document_ID:
            inverse_table[-1][2].append(step)
            hash_table.update(first_hash_value, second_hash_value, inverse_table)
        i += 2
        step += 1

示例：

3. 实验分析

对数据库中每一条问句语料，我们构建一个文本地址，共166250个文本。构建bigram二级哈希索引。hashtable的大小约290MB。实际上，表是很稀疏的。

在索引过程中，获得测试句子的bigram项，计算bigram项的一级哈希值和二级哈希值，根据hashtable索引到包含这一bigram项的所有倒排列表。获取倒排列表中文档编号。实际上，这是一个很宽泛的范围。接下来，我们再对相邻bigram项的文档编号做交集，来获得包含trigram项的集合。这样，就会把范围进一步缩小很多。且索引速度在1~20毫秒之间。

用约1600多条在es搜索上表现良好的（结果相似度在0.9以上）句子做测试。最终通过es搜索获得的最佳结果包含在通过bigram检索的集合中的概率大概在82%。

4. 思考

1. 通过bigram二级哈希索引，我们既能做到模糊匹配，也能做到精确匹配。且耗时较短。

一般一个短句在1~2毫秒左右。

2. 对于模糊匹配我们如何对结果做rank（如果不考虑相似度计算）。

3. 由于每个txt文本很短，实际上倒排列表我们用的信息很少。

4. 对于文本中出现的英文符号和数字等其他符号无法处理。

5. 模型的更新需要重新训练。

5. 改进

1. 直接用原文本代替倒排列表存储在hashtable中，table的大小变为270MB(变小了？)。

2. 针对索引结果，考虑做两轮交集。

如果集合存在：

精确匹配；

如果集合不存在或者没有精确匹配：

计算原句与结果中句子的长度的差，如果差值小于3，输出结果。最终结果的范围较小。

最后的测试结果在73%左右。（不受标点符号影响结果应该在80%以上）

3. 考虑以这种方式存入数据库，不用训练模型。方便更新。

def match_sentences(sent):
    text = sent.replace('？', '').replace('。', '').replace('，', '')
    len_text = len(text)
    match_scent = []
    res = []
    for i in range(len(text) - 1):
        bigram_words = text[i:i + 2]
        first_hash_value = cbp.get_hash_value(bigram_words[0])
        second_hash_value = cbp.get_hash_value(bigram_words[1])
        # inverse_table = model2.first_index[first_hash_value].second_index[second_hash_value]
        # document_id = [table[0] for table in inverse_table]
        # res.append({bigram_words: document_id})
        res_result = model1.first_index[first_hash_value].second_index[second_hash_value]
        # print(res_result)
        res.append(res_result)

    return res
    
def get_best_result(sent, res):
    result = res.copy()
    count = 1
    # print(lenth)
    while count < 3:
        two_intersection = []
        for i in range(len(result) - 1):
            intersection = list(set(result[i]).intersection(set(result[i+1])))
            if intersection:
                two_intersection.append(intersection)
        # print(two_intersection)
        count += 1
        if not two_intersection:
            break
        else:
            result = two_intersection

    rres = []
    for ress in result:
        if sent in ress:
            result = sent
            return  result
        for item in ress:
            l = abs(len(sent) - len(item))
            if l < 6:
                rres.append(item)
    result = rres
    return result

gdp1368@outlook.com 's Blog

1. 基础知识：

1.1 中文GB2312编码

1.2 倒排索引

2. 中文Bigram二级哈希索引结构及创建。

2.1 中文Bigram二级哈希索引

2.2 索引创建

3. 实验分析

4. 思考

5. 改进

导航

最近发表

友情链接

gdp1368@outlook.com 's Blog

1. 基础知识：

1.1 中文GB2312编码

1.2 倒排索引

2. 中文Bigram二级哈希索引结构及创建。

2.1 中文Bigram二级哈希索引

2.2 索引创建﻿﻿﻿﻿﻿

3. 实验分析

4. 思考

5. 改进

导航

最近发表

友情链接

2.2 索引创建