StarDict格式辞典单词列表提取工具

技术2022-05-20 42

今天一个朋友做了个Emacs的单词补全，特别好用，但是他自带的词库单词比较少，我就做了个单词提取工具，可以提取StarDict格式的辞典库。

StarDict的单词都存储在dictname.idx中，格式为：

每个单词后面跟着一个'/0'作为单词结束标志，再往后跟着8个字节用来描述单词在字典里的偏移和解释长度。

比如第一个单词a：

61 00 00 00 00 00 00 00 03 E4

我们需要的只是单词的字面值，后面的8个字节跳过就可以了。

代码如下：

#include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, char *argv[]) { FILE *fp; int file_length, file_cur = 0; char byte_data, word_pos = 0; char word[50]; if (argc != 2) { printf("usage: ew <dictname.idx>/n"); exit (1); } fp = fopen(argv[1], "rb+"); fseek(fp, 0, SEEK_END); file_length = ftell(fp); fseek(fp, 0, SEEK_SET); while(file_cur < file_length) { fread(&byte_data, 1, 1, fp); if (byte_data == 0) { word[word_pos] = '/0'; if (strlen(word) > 0) { printf("%s/n", word); file_cur += 8; fseek(fp, 8, SEEK_CUR); } else { file_cur++; } word[0] = 0; word_pos = 0; } else { word[word_pos] = byte_data; word_pos++; file_cur++; } } fclose(fp); return 0; } 调用的时候很简单，比如我们需要提取cdict-gb.idx，可以用： $./a.out cdict-gb.idx > wordlist.txt 这样就得到单词列表了，每个单词占一行。

专利

最新回复(0)