In Rust We Thrust

Rust 语言的一些奇技淫巧。

字符串基于 Unicode 字符的切片

Rust 的字符串（&str, String）本质上都是按照 UTF-8 编码的 8 位无符号数构成的序列，而 UTF-8 作为一种变长编码，是没法在 $O (1)$ 时间内查询到它编码的特定下标的 Unicode 字符的，而 The Book 指出在 Rust 中索引运算符 [] 应该是 $O (1)$ 的，因此 String 不支持通过索引直接读取字符（String 本身不是 char 的序列，所以也没法按照 ops::Index::index 的签名返回 &char 类型），切片操作也是以 UTF-8 字节数作为索引进行切片。

不过，str 提供了 .chars() 和 .char_indices() 方法，分别返回 char 和 (usize, char) 的迭代器，返回的都是 Unicode 字符 char，因此可以利用这两个迭代器做索引和切片。

let index: usize = 1;
let my_string = "你好 Rust!";
let ch: Option<char> = my_string.chars().nth(index); // Some('好')

切片则更复杂一些：

assert!(
  index.end >= index.start,
  "Start index should have been less than end index, but {} is not less than {}",
  index.start,
  index.end
);

let mut it = my_string.char_indices().skip(index.start).peekable(); // 利用 Peek 截取 start 位置

let start = match it.peek() {
  Some((idx, _)) => *idx,
  None => panic!("Start index {} is out of bounds", index.start),
};

let end = match it.take(index.end - index.start + 1).last() {
  Some((idx, _)) => idx,
  None => my_string.len(),
};

// safe
let safe_slice = &my_string[start..end];
// unsafe
let unsafe_slice = unsafe { my_string.get_unchecked(start..end) };

这个方法只需要一次调用 .char_indices()，不需要堆上分配也不需要遍历整个字符串，比 utf8_slice 这个 crate 的实现快一些。

test slice_my_snippet            ... bench:          12 ns/iter (+/- 1)
test slice_with_byte_index       ... bench:           0 ns/iter (+/- 0)
test slice_with_utf8_slice_crate ... bench:         121 ns/iter (+/- 16)