大规模中英文单词模糊搜索问题的分析

deltamaster posted @ 14 年前 in 人工智能 with tags 索引人工智能匹配模糊搜索模式识别 , 6049 阅读

　　最近有一个关于在大规模的中英文词库中的模糊搜索问题。

　　这个问题具有几个特点：

每条记录都比较短。
记录的数量多，大约600万条记录。
对解的数量要求高于对解的质量要求。
有对中英文支持的要求。

　　对于单一的字符串模糊匹配问题，实际上是单纯的人工智能问题，站在模式识别的角度，就是计算与目标单词的距离，我们实际上需要通过求最一般合一来算出单词之间的编辑距离，这个编辑距离越大，匹配度就越低。如果想更进一步，那么借助遗传算法的思想，随着用户的操作来动态调整每一步编辑的权值（默认所有编辑行为的权值都为1），如果用户认可某次匹配，则对此次匹配所用到的所有编辑行为降低权值。

　　而对于这样的一个实际问题，首先我们不可能将600万条记录全部取出逐一检测匹配度，势必要进行一个事先的筛选。因此，大致的流程应该是这样：

针对所有数据建立倒排索引。
从索引查找到疑似的匹配项。
对疑似匹配项的匹配度进行逐个检查，去掉匹配度低的结果。
按匹配度从高到低输出结果。

　　针对此问题，还有以下一些问题需要解决。

　　而求编辑距离必须使用基于代价函数的广度优先搜索，否则，无法保证最优解，在权值全部相同的情况下，就等同于一般的广度优先搜索，修改步数增加以后，内存开销明显增加。

　　为了减小索引，加快索引速度，必须要合并类似的索引项，对于中文，根据拼音-汉字库将汉字一律合并至为对应拼音是可行的。

　　但是对于英文，不可能以字符为单位进行索引，而如果以单词为单位进行索引，拼写错误就无法从索引中找出相应项。如果对无法找到的项进行广度优先的编辑，试图从索引中找出最近似的条目，对于索引查询的开销也是极高的。

　　单词拆分对索引也是一个考验，因为如果某个单词无法匹配，其实则可能可以进行拆分，比如helloworld，有9种方法分成两个单词，那么需要对9种拆分方法逐个进行查询，开销也是很大的。如果在这个基础上再考虑拼写错误，那么开销就是平方级的增长了。

* 本文在CC BY-SA（署名-相同方式共享）协议下发布。

相关文章
全局相关文章

[回复]

JSC Result Comilla B 说:
3 年前

Comilla board is another education board working under Secondary and Higher Secondary Education, Bangladesh, and the education board is also successfully completed the Grade 8 terminal examinations at all selected examination test centers at Comilla division, and the Junior School Certificate and Junior Dakhil Certificate terminal examination is a second largest examination in the country. JSC Result Comilla Board The Comilla Board is also completed those subject wise exams between 1st to 3rd week of November as per schedule announced by School Education department, and there are a huge number of students are participated like as all other educational boards from all districts of the division, right now they are waiting to check JSC & JDC Result 2022 Comilla Board with subject wise marks and total marksheet with final CGPA grade of the student.

[回复]

BSNL Broadband Plan 说:
2 年前

ISP provides unlimited calls to any network round the clock in all the BSNL broadband plans over fiber and DSL networks, and Here we update the latest BSNL broadband unlimited plans daily across India as and when the update released for home and business tariff, BSNL Broadband Plan So check each plan in detail to opt for best tariff. BSNL broadband plans over DSL and Bharat Fiber (FTTH) technologies are available, we categorize each plan and provide the updated information of all the circles with new plans and tariff.

Deltamaster Tech Center

大规模中英文单词模糊搜索问题的分析

deltamaster

搜索

分类

标签云

RSS

功能

链接

最新评论

最新留言