一次脚本语言的性能比拼(下)--C++, C, Unix Shell

简介：文章描述了一个统计不同单词出现次数的常见面试题目，并使用了不同的语言实现并且比较它们之间的性能。对于每一种语言，均包含实现的简单版本与优化版本。

C++

我已经很久没有使用 C++ 去编程了：显然 C++14, 17, 以及 20 增加了许多新的特性，而且更简洁，但是错误提示仍然还是一团糟。C++ 实现如下：

simple.cpp

int main() {
    std::string word;
    std::unordered_map<std::string, int> counts;
    while (std::cin >> word) {
        std::transform(word.begin(), word.end(), word.begin(),
            [](unsigned char c){ return std::tolower(c); });
        ++counts[word];
    }
    if (std::cin.bad()) {
        std::cerr << "error reading stdin\n";
        return 1;
    }

    std::vector<std::pair<std::string, int>> ordered(counts.begin(),
        counts.end());
    std::sort(ordered.begin(), ordered.end(),
        [](auto const& a, auto const& b) { return a.second>b.second; });

    for (auto const& count : ordered) {
        std::cout << count.first << " " << count.second << "\n";
    }
}

当开始着手优化，第一件应该做的事情是开启编译优化，确认将 -O2 参数加入到编译命令中。在 Go 语言中则不用关心这点—因为 Go 总是在默默优化您的代码。

我留意到 I/O 比较慢，实际上，可以在每次 I/O 操作后禁用与 C stdio 函数的同步，仅做这一点修改，就让程序运行几乎快了两倍：

ios::sync_with_stdio(false);

GCC 可以使用 gprof 生成性能报告。生成的报告大概像下面这样（我真的没有骗你）：

index % time    self  children    called     name
                                  13             frame_dummy [1]
[1]    100.0    0.01    0.00       0+13      frame_dummy [1]
                                  13             frame_dummy [1]
-----------------------------------------------
                0.00    0.00   32187/32187       std::vector<std::pair\
<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat\
or<char> >, int>, std::allocator<std::pair<std::__cxx11::basic_string<\
char, std::char_traits<char>, std::allocator<char> >, int> > >::vector\
<std::__detail::_Node_iterator<std::pair<std::__cxx11::basic_string<ch\
ar, std::char_traits<char>, std::allocator<char> > const, int>, false,\
 true>, void>(std::__detail::_Node_iterator<std::pair<std::__cxx11::ba\
sic_string<char, std::char_traits<char>, std::allocator<char> > const,\
 int>, false, true>, std::__detail::_Node_iterator<std::pair<std::__cx\
x11::basic_string<char, std::char_traits<char>, std::allocator<char> >\
 const, int>, false, true>, std::allocator<std::pair<std::__cxx11::bas\
ic_string<char, std::char_traits<char>, std::allocator<char> >, int> >\
 const&) [11]
[8]      0.0    0.00    0.00   32187         void std::__cxx11::basic_\
string<char, std::char_traits<char>, std::allocator<char> >::_M_constr\
uct<char*>(char*, char*, std::forward_iterator_tag) [8]
-----------------------------------------------
                0.00    0.00       1/1           __libc_csu_init [17]
[9]      0.0    0.00    0.00       1         _GLOBAL__sub_I_main [9]
...

真让人头疼！我更喜欢去了解 malloc 以及 scanf 而不是 std::basic_istream >& std::operator>>, std::allocator >(std::basic_istream >&, std::__cxx11::basic_string, std::allocator >&).

我真不想破译这个输出，所以我放弃了，而把精力花在优化的 C 版本上。显然，你可以用 C++ 做更多的事情。但是，我怀疑它最终会变得越来越低级和更像 C（至少在我对现代 C++ 的有限知识的情况下），如果想了解更多，直接看下面的 C 优化后的代码吧。

我在 C 版本中使用了 Valgrind 分析工具（Callgrind）——请参阅下面的部分以获取有关说明。Andrew Gallant 指出我可以尝试 Linux perf 工具（特别是 perf record 和 perf report ）——它看起来确实比 gprof 更好一些。

更新: Jussi Pakkanen 和 Adev 等人优化了 C++ 实现。感谢！

C

C 是一头永不消亡的美丽野兽：快速、灵活且简单。我喜欢它，因为（与 C++ 不同）我可以理解它，并且我可以随心所欲地去修改任何东西。它也无处不在（Linux 内核、Redis、PostgreSQL、SQLite、许多库……不胜枚举），而且它不会很快就被淘汰。

C 在其标准库中没有哈希表数据结构。但是，它可以使用 libc ，它具有 hcreate 和 hsearch 哈希表函数，所以我们这里允许一个小例外，并使用那些 libc-but-not-stdlib 函数。在优化版本中，我们将实现自己的哈希表。

一个小问题是，调用 hcreate 必须预先指定最大可能的表大小。我知道唯一词的数量大约是 30,000，所以我将其设为 60,000。

simple.c

#define MAX_UNIQUES 60000

typedef struct {
    char* word;
    int count;
} count;

// Comparison function for qsort() ordering by count descending.
int cmp_count(const void* p1, const void* p2) {
    int c1 = ((count*)p1)->count;
    int c2 = ((count*)p2)->count;
    if (c1 == c2) return 0;
    if (c1 < c2) return 1;
    return -1;
}

int main() {
    // The hcreate hash table doesn't provide a way to iterate, so
    // store the words in an array too (also used for sorting).
    count* words = calloc(MAX_UNIQUES, sizeof(count));
    int num_words = 0;

    // Allocate hash table.
    if (hcreate(MAX_UNIQUES) == 0) {
        fprintf(stderr, "error creating hash table\n");
        return 1;
    }

    char word[101]; // 100-char word plus NUL byte
    while (scanf("%100s", word) != EOF) {
        // Convert word to lower case in place.
        for (char* p = word; *p; p++) {
            *p = tolower(*p);
        }

        // Search for word in hash table.
        ENTRY item = {word, NULL};
        ENTRY* found = hsearch(item, FIND);
        if (found != NULL) {
            // Word already in table, increment count.
            int* pn = (int*)found->data;
            (*pn)++;
        } else {
            // Word not in table, insert it with count 1.
            item.key = strdup(word); // need to copy word
            if (item.key == NULL) {
                fprintf(stderr, "out of memory in strdup\n");
                return 1;
            }
            int* pn = malloc(sizeof(int));
            if (pn == NULL) {
                fprintf(stderr, "out of memory in malloc\n");
                return 1;
            }
            *pn = 1;
            item.data = pn;
            ENTRY* entered = hsearch(item, ENTER);
            if (entered == NULL) {
                fprintf(stderr, "table full, increase MAX_UNIQUES\n");
                return 1;
            }

            // And add to words list for iterating.
            words[num_words].word = item.key;
            num_words++;
        }
    }

    // Iterate once to add counts to words list, then sort.
    for (int i = 0; i < num_words; i++) {
        ENTRY item = {words[i].word, NULL};
        ENTRY* found = hsearch(item, FIND);
        if (found == NULL) { // shouldn't happen
            fprintf(stderr, "key not found: %s\n", item.key);
            return 1;
        }
        words[i].count = *(int*)found->data;
    }
    qsort(&words[0], num_words, sizeof(count), cmp_count); 

    // Iterate again to print output.
    for (int i = 0; i < num_words; i++) {
        printf("%s %d\n", words[i].word, words[i].count);
    }

    return 0;
}

有很多看起来并不怎么优雅的空间分配以及错误检查代码，但是这并不是什么大问题。值得关心的问题应该在 scanf 背后的拆词逻辑，hsearch 背后的哈希表操作。值得一提的是，代码开箱即用，并且在 Linux 上的可执行文件仅有 17KB 。

在性能分析阶段，我尝试使用 gprof ，但是貌似没有提供给我太多有用的信息，可能是我采样的频率不够？因此我转而使用 Valgrind 分析工具—Callgrind—进行分析。这是我第一次使用，但不影响它是一个神奇而强大的工具。

使用 gcc -g 构建后，执行下面命令生成性能分析结果：

valgrind --tool=callgrind ./simple-c <kjvbible_x10.txt >/dev/null

结果显示，scanf 是罪魁祸首，其次是 hsearch 。下面是我们要优化的三个地方：

按块读取输入，就像 Go 和 Python 那样以避免使用 scanf。
尽可能对于输出只处理一遍：在分割单词时就完成转换小写以及计算哈希的操作。
使用 FNV-1 hash function 自定义哈希表。

optimized.c

#define BUF_SIZE 65536
#define HASH_LEN 65536  // must be a power of 2
#define FNV_OFFSET 14695981039346656037UL
#define FNV_PRIME 1099511628211UL

// Used both for hash table buckets and array for sorting.
typedef struct {
    char* word;
    int word_len;
    int count;
} count;

// Comparison function for qsort() ordering by count descending.
int cmp_count(const void* p1, const void* p2) {
    int c1 = ((count*)p1)->count;
    int c2 = ((count*)p2)->count;
    if (c1 == c2) return 0;
    if (c1 < c2) return 1;
    return -1;
}

count* table;
int num_unique = 0;

// Increment count of word in hash table (or insert new word).
void increment(char* word, int word_len, uint64_t hash) {
    // Make 64-bit hash in range for items slice.
    int index = (int)(hash & (uint64_t)(HASH_LEN-1));

    // Look up key, using direct match and linear probing if not found.
    while (1) {
        if (table[index].word == NULL) {
            // Found empty slot, add new item (copying key).
            char* word_copy = malloc(word_len);
            if (word_copy == NULL) {
                fprintf(stderr, "out of memory\n");
                exit(1);
            }
            memmove(word_copy, word, word_len);
            table[index].word = word_copy;
            table[index].word_len = word_len;
            table[index].count = 1;
            num_unique++;
            return;
        }
        if (table[index].word_len == word_len &&
                memcmp(table[index].word, word, word_len) == 0) {
            // Found matching slot, increment existing count.
            table[index].count++;
            return;
        }
        // Slot already holds another key, try next slot (linear probe).
        index++;
        if (index >= HASH_LEN) {
            index = 0;
        }
    }
}

int main() {
    // Allocate hash table buckets.
    table = calloc(HASH_LEN, sizeof(count));
    if (table == NULL) {
        fprintf(stderr, "out of memory\n");
        return 1;
    }

    char buf[BUF_SIZE];
    int offset = 0;
    while (1) {
        // Read file in chunks, processing one chunk at a time.
        size_t num_read = fread(buf+offset, 1, BUF_SIZE-offset, stdin);
        if (num_read+offset == 0) {
            break;
        }

        // Find last space or linefeed in buf and process up to there.
        int space;
        for (space = offset+num_read-1; space>=0; space--) {
            char c = buf[space];
            if (c <= ' ') {
                break;
            }
        }
        int num_process = (space >= 0) ? space : (int)num_read+offset;

        // Scan chars to process: tokenize, lowercase, and hash as we go.
        int i = 0;
        while (1) {
            // Skip whitespace before word.
            for (; i < num_process; i++) {
                char c = buf[i];
                if (c > ' ') {
                    break;
                }
            }
            // Look for end of word, lowercase and hash as we go.
            uint64_t hash = FNV_OFFSET;
            int start = i;
            for (; i < num_process; i++) {
                char c = buf[i];
                if (c <= ' ') {
                    break;
                }
                if (c >= 'A' && c <= 'Z') {
                    c += ('a' - 'A');
                    buf[i] = c;
                }
                hash *= FNV_PRIME;
                hash ^= (uint64_t)c;
            }
            if (i <= start) {
                break;
            }
            // Got a word, increment count in hash table.
            increment(buf+start, i-start, hash);
        }

        // Move down remaining partial word.
        if (space >= 0) {
            offset = (offset+num_read-1) - space;
            memmove(buf, buf+space+1, offset);
        } else {
            offset = 0;
        }
    }

    count* ordered = calloc(num_unique, sizeof(count));
    for (int i=0, i_unique=0; i<HASH_LEN; i++) {
        if (table[i].word != NULL) {
            ordered[i_unique++] = table[i];
        }
    }
    qsort(ordered, num_unique, sizeof(count), cmp_count);
    for (int i=0; i<num_unique; i++) {
        printf("%.*s %d\n",
                ordered[i].word_len, ordered[i].word, ordered[i].count);
    }

    return 0;
}

上面大约有 150 行（包括空白和注释），它绝对是迄今为止最大的程序！哈希表中使用线性滚动查询并没有很多代码。另外，有一个固定大小的哈希表可能不是很好，实现动态调整哈希表大小也不是很容易，并且不会显着减少运行时间，所以我把它作为练习留给你了。

它仍然只有 17KB 的可执行文件（这就是我喜欢 C 的地方）。而且，不出所料，这是最快的版本——比优化的 Go 版本快一点，因为我们对于哈希表自己实现了线性滚动查询，这让我们处理的字节更少。

如后面所要总结的结果所示，这个版本仅比 wc -w 在相同输入上慢 15% 左右。wc 所做的是统计单词数量，而没有统计唯一单词出现的数目，因此它不需要哈希表。

当然，这个程序比令人难以置信的 GNU grep 慢得多。GNU grep 的原作者 Mike Haertel 在 2010 年有一条著名的邮件列表消息，关于为什么 GNU grep 速度快。这是一篇引人入胜的读物——然而，正如 Andrew Gallant（ripgrep的作者）指出的那样，这篇文章有些过时了。Mike 的建议是好的，但是现在 GNU grep 的速度很快并不是因为使用 Boyer-Moore 算法，而是因为它使用了 memchrglibc 中的快速矢量实现。

我相信你还可以写出更优化的版本，考虑内存映射 I/O，用更高级的数据结构进行计数等等。但这已经足够了！

Unix shell

Doug McIlroy 提供了一个只有基本 Unix 命令行工具的解决方案：

tr 'A-Z' 'a-z' | tr -s ' ' '\n' | sort | uniq -c | sort -nr

它非常慢，部分原因是它必须一次对整个文件进行排序，而不是使用哈希表进行计数（将整个文件读入内存实际上违反了我预先设置的约束）。但是，如果启用 LC_ALL=C 标志（仅限 ASCII），sort 的速度会提高 5 倍，这比许多其他语言中优化过的实现都要快！

我们还可以通过使用空间换时间的策略，给它一个更大的缓冲区来获得更多的性能 sort -S 2GB。有趣的是，sort 的 —parallel 选项在这个问题上没有帮助。所以优化后的版本如下：

tr 'A-Z' 'a-z' | tr -s ' ' '\n' | LC_ALL=C sort -S 2G | uniq -c | \
    sort -nr

对于相对较小的文件来说，这个方案已经足够好了，但是如果想处理大文件，而不是将整个文件读入内存进行排序，使用 AWK 或 Python 版本的程序会更好。

这个方案的输出与其他程序的输出不太一样，如下所示：

      4 the
      2 foo
      1 defenestration

我们可以用 awk ‘{ print $2, $1 }’ 命令来解决这个问题，不过既然要使用 awk ，不妨直接使用代码库中提供的优化后的 awk 程序。

其他语言

许多读者为 benhoyt/countwords 做出了许多贡献，在此感谢！

注意，我不再接受新的代码贡献了。

其他语言实现请查看作者仓库 benhoyt/countwords ，暂不在此罗列。

性能分析结果与总结

下面是在我的笔记本电脑上运行这些程序的性能数据（带有 SSD 的 64 位 Linux ）。我将每个测试运行了五次，并以最短时间作为结果。每次运行基本等同于执行以下命令：

time $PROGRAM <kjvbible_x10.txt >/dev/null

时间以秒为单位，下面列表按简单实现的版本的执行时间排序，最快的在前。注意，实际上 grep 以及 wc 并没有解决单词统计问题，列在这里只是为了比较。

Language	Simple	Optimized	Notes
grep	0.04	0.04	grep baseline; optimized sets LC_ALL=C
wc -w	0.28	0.20	wc baseline; optimized sets LC_ALL=C
Zig	0.55	0.24	by ifreund and matu3ba and ansingh
Nim	0.77	0.49	by csterritt and euantorano
C	0.96	0.23
Go	1.12	0.40
OCaml	1.18		by Nate Dobbins and Pavlo Khrystenko
Crystal	1.29		by Andrea Manzini
Rust	1.38	0.43	by Andrew Gallant
Java	1.40	1.34	by Iulian Plesoianu
PHP	1.40		by Max Semenik
C#	1.50	0.82	by J Taylor, Y Ostapenko, O Turan
C++	1.69	0.27	optimized by Jussi P, Adev, Nathan M
Perl	1.81		by Charles Randall
Kotlin	1.81		by Kazik Pogoda
F#	1.81	1.60	by Yuriy Ostapenko
JavaScript	1.88	1.10	by Dani Biro and Flo Hinze
D	2.05	0.74	by Ross Lonstein
Python	2.21	1.33
Lua	2.50	2.00	by themadsens; runs under luajit
Ruby	3.17	2.47	by Bill Mill
AWK	3.55	1.13	optimized uses mawk
Pascal	3.67		by Osman Turan
Forth	4.22	1.45
Swift	4.23		by Daniel Muellenborn
Common Lisp	4.97		by Brad Svercl
Tcl	6.82		by William Ross
Haskell	12.81		by Adrien Glauser
Shell	14.81	1.83	optimized does LC_ALL=C sort -S 2G

我们可以从中学到什么呢？这里有一些想法同您分享：

未经优化的简单代码最具解释性，在现实任务中程序员更倾向书写这样的代码。
几乎可以肯定，您不会特意去实现 C 的优化版本，除非您要写一个新的 GNU wordfreq - 工具或者类似的东西。书写程序的过程中太容易出错了，如果您想要一个在安全语言下的快速程序，推荐使用 Go 或者 Rust。
如果您只需要一个快速的解决方案（很可能），Python 和 AWK 对这种文本处理来说非常棒。
C++ 模板在性能分析工具中产生几乎不可读的错误消息和函数名称。
我仍然认为这是个很好的面试问题，但显然我不会强制要求面试者能够在白板上写出优化的解决方案之一。
我们通常认为 I/O 很消耗大部分时间，但 I/O 并不是这里的瓶颈。在基准测试的情况下，文件可能已被缓存，即使没有，如今硬盘读取速度也已经非常快了。整体来看，程序的瓶颈在于单词分割与哈希操作。

{测试窝原创译文，译者：lukeaxu}