Item 1: Use the type system to express your data structures
"who called them programers and not type writers" – @thingskatedid
This Item provides a quick tour of Rust's type system, starting with the fundamental types that the compiler makes available, then moving on to the various ways that values can be combined into data structures.
Rust's enum
type then takes a starring role. Although the basic version is equivalent to what other languages provide,
the ability to combine enum
variants with data fields allows for enhanced flexibility and expressivity.
Fundamental Types
The basics of Rust's type system are pretty familiar to anyone coming
from another statically typed programming language (such as C++, Go, or Java).
There's a collection of integer types with specific sizes, both signed
(i8
,
i16
,
i32
,
i64
,
i128
)
and unsigned
(u8
,
u16
,
u32
,
u64
,
u128
).
There are also signed (isize
) and unsigned
(usize
) integers whose sizes match the pointer size on
the target system. However, you won't be doing much in the way of converting between
pointers and integers with Rust, so that size equivalence isn't really relevant. However, standard collections return their size
as a usize
(from .len()
), so collection indexing means that usize
values are quite common—which is
obviously fine from a capacity perspective, as there can't be more items in an in-memory collection than there are
memory addresses on the system.
The integral types do give us the first hint that Rust is a stricter world than C++. In Rust, attempting to
put a larger integer type (i32
) into a smaller integer type (i16
) generates a compile-time error:
let x: i32 = 42;
let y: i16 = x;
error[E0308]: mismatched types
--> src/main.rs:18:18
|
18 | let y: i16 = x;
| --- ^ expected `i16`, found `i32`
| |
| expected due to this
|
help: you can convert an `i32` to an `i16` and panic if the converted value
doesn't fit
|
18 | let y: i16 = x.try_into().unwrap();
| ++++++++++++++++++++
This is reassuring: Rust is not going to sit there quietly while the programmer does things that are risky. Although we can see that the values involved in this particular conversion would be just fine, the compiler has to allow for the possibility of values where the conversion is not fine:
let x: i32 = 66_000;
let y: i16 = x; // What would this value be?
The error output also gives an early indication that while Rust has stronger rules, it also has helpful compiler
messages that point the way to how to comply with the rules. The suggested solution raises the question of how to
handle situations where the conversion would have to alter the value to fit, and we'll have more to say on both
error handling (Item 4) and using panic!
(Item 18) later.
Rust also doesn't allow some things that might appear "safe", such as putting a value from a smaller integer type into a larger integer type:
let x = 42i32; // Integer literal with type suffix
let y: i64 = x;
error[E0308]: mismatched types
--> src/main.rs:36:18
|
36 | let y: i64 = x;
| --- ^ expected `i64`, found `i32`
| |
| expected due to this
|
help: you can convert an `i32` to an `i64`
|
36 | let y: i64 = x.into();
| +++++++
Here, the suggested solution doesn't raise the specter of error handling, but the conversion does still need to be explicit. We'll discuss type conversions in more detail later (Item 5).
Continuing with the unsurprising primitive types, Rust has a
bool
type, floating point types
(f32
,
f64
),
and a unit type ()
(like C's void
).
More interesting is the char
character type, which holds a
Unicode value (similar to Go's rune
type). Although this is stored as four bytes internally, there are again no silent
conversions to or from a 32-bit integer.
This precision in the type system forces you to be explicit about what you're trying to express—a u32
value is
different from a char
, which in turn is different from a sequence of UTF-8 bytes, which in turn is different
from a sequence of arbitrary bytes, and it's up to you to specify exactly which you mean.1 Joel Spolsky's famous blog
post can help you understand which you need.
Of course, there are helper methods that allow you to convert between these different types, but their signatures force
you to handle (or explicitly ignore) the possibility of failure. For example, a Unicode code point
can always be represented in 32 bits,2 so 'a' as u32
is allowed, but the other direction is trickier (as there
are some u32
values that are not valid Unicode code points):
char::from_u32
: Returns anOption<char>
, forcing the caller to handle the failure case.char::from_u32_unchecked
: Makes the assumption of validity but has the potential to result in undefined behavior if that assumption turns out not to be true. The function is markedunsafe
as a result, forcing the caller to useunsafe
too (Item 16).
Aggregate Types
Moving on to aggregate types, Rust has a variety of ways to combine related values. Most of these are familiar equivalents to the aggregation mechanisms available in other languages:
- Arrays: Hold multiple instances of a single
type, where the number of instances is known at compile time. For example,
[u32; 4]
is four 4-byte integers in a row. - Tuples: Hold instances of multiple
heterogeneous types, where the number of elements and their types are known at compile time, for example,
(WidgetOffset, WidgetSize, WidgetColor)
. If the types in the tuple aren't distinctive—for example,(i32, i32, &'static str, bool)
—it's better to give each element a name and use a struct. - Structs: Also hold instances of heterogeneous types known at compile time but allow both the overall type and the individual fields to be referred to by name.
Rust also includes the tuple struct, which is a crossbreed of a struct
and a tuple: there's a name for the overall
type but no names for the individual fields—they are referred to by number instead: s.0
, s.1
, and so on:
#![allow(unused)] fn main() { /// Struct with two unnamed fields. struct TextMatch(usize, String); // Construct by providing the contents in order. let m = TextMatch(12, "needle".to_owned()); // Access by field number. assert_eq!(m.0, 12); }
enum
s
This brings us to the jewel in the crown of Rust's type system, the enum
.
With the basic form of an enum
, it's hard to see what there is to get excited about. As with other languages, the enum
allows you
to specify a set of mutually exclusive values, possibly with a numeric value attached:
#![allow(unused)] fn main() { enum HttpResultCode { Ok = 200, NotFound = 404, Teapot = 418, } let code = HttpResultCode::NotFound; assert_eq!(code as i32, 404); }
Because each enum
definition creates a distinct type, this can be used to improve readability and maintainability of
functions that take bool
arguments. Instead of:
print_page(/* both_sides= */ true, /* color= */ false);
a version that uses a pair of enum
s:
#![allow(unused)] fn main() { pub enum Sides { Both, Single, } pub enum Output { BlackAndWhite, Color, } pub fn print_page(sides: Sides, color: Output) { // ... } }
is more type-safe and easier to read at the point of invocation:
print_page(Sides::Both, Output::BlackAndWhite);
Unlike the bool
version, if a library user were to accidentally flip the order of the arguments, the compiler would
immediately complain:
error[E0308]: arguments to this function are incorrect
--> src/main.rs:104:9
|
104 | print_page(Output::BlackAndWhite, Sides::Single);
| ^^^^^^^^^^ --------------------- ------------- expected `enums::Output`,
| | found `enums::Sides`
| |
| expected `enums::Sides`, found `enums::Output`
|
note: function defined here
--> src/main.rs:145:12
|
145 | pub fn print_page(sides: Sides, color: Output) {
| ^^^^^^^^^^ ------------ -------------
help: swap these arguments
|
104 | print_page(Sides::Single, Output::BlackAndWhite);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the newtype pattern—see Item 6—to wrap a bool
also achieves type safety and maintainability; it's
generally best to use the newtype pattern if the semantics will always be Boolean, and to use an enum
if there's a chance that a new
alternative—e.g., Sides::BothAlternateOrientation
—could arise in the future.
The type safety of Rust's enum
s continues with the match
expression:
let msg = match code {
HttpResultCode::Ok => "Ok",
HttpResultCode::NotFound => "Not found",
// forgot to deal with the all-important "I'm a teapot" code
};
error[E0004]: non-exhaustive patterns: `HttpResultCode::Teapot` not covered
--> src/main.rs:44:21
|
44 | let msg = match code {
| ^^^^ pattern `HttpResultCode::Teapot` not covered
|
note: `HttpResultCode` defined here
--> src/main.rs:10:5
|
7 | enum HttpResultCode {
| --------------
...
10 | Teapot = 418,
| ^^^^^^ not covered
= note: the matched value is of type `HttpResultCode`
help: ensure that all possible cases are being handled by adding a match arm
with a wildcard pattern or an explicit pattern as shown
|
46 ~ HttpResultCode::NotFound => "Not found",
47 ~ HttpResultCode::Teapot => todo!(),
|
The compiler forces the programmer to consider all of the possibilities
that are represented by the enum
,3
even if the result is just to add a default arm _ => {}
.
(Note that modern C++ compilers can and do warn about missing switch
arms for enum
s as well.)
enum
s with Fields
The true power of Rust's enum
feature comes from the fact that each variant can have data that comes along with it,
making it an aggregate type that acts as an algebraic data
type (ADT). This is less familiar to programmers of mainstream
languages; in C/C++ terms, it's like a combination of an enum
with a union
—only type-safe.
This means that the invariants of the program's data structures can be encoded into Rust's type system; states that
don't comply with those invariants won't even compile. A well-designed enum
makes the creator's intent clear to
humans as well as to the compiler:
use std::collections::{HashMap, HashSet};
pub enum SchedulerState {
Inert,
Pending(HashSet<Job>),
Running(HashMap<CpuId, Vec<Job>>),
}
Just from the type definition, it's reasonable to guess that Job
s get queued up in the Pending
state until the
scheduler is fully active, at which point they're assigned to some per-CPU pool.
This highlights the central theme of this Item, which is to use Rust's type system to express the concepts that are associated with the design of your software.
A dead giveaway for when this is not happening is a comment that explains when some field or parameter is valid:
pub struct DisplayProps {
pub x: u32,
pub y: u32,
pub monochrome: bool,
// `fg_color` must be (0, 0, 0) if `monochrome` is true.
pub fg_color: RgbColor,
}
This is a prime candidate for replacement with an enum
holding data:
pub enum Color {
Monochrome,
Foreground(RgbColor),
}
pub struct DisplayProps {
pub x: u32,
pub y: u32,
pub color: Color,
}
This small example illustrates a key piece of advice: make invalid states inexpressible in your types. Types that support only valid combinations of values mean that whole classes of errors are rejected by the compiler, leading to smaller and safer code.
Ubiquitous enum
Types
Returning to the power of the enum
, there are two concepts that are so common that Rust's standard library
includes built-in enum
types to express them; these types are ubiquitous in Rust code.
Option<T>
The first concept is that of an Option
: either there's
a value of a particular type (Some(T)
) or there isn't (None
). Always use
Option
for values that can be absent; never fall back to using sentinel values (-1, nullptr
, …) to
try to express the same concept in-band.
There is one subtle point to consider, though. If you're dealing with a collection of things, you need to decide
whether having zero things in the collection is the same as not having a collection. For most situations, the
distinction doesn't arise and you can go ahead and use (say) Vec<Thing>
: a count of zero things implies an absence of
things.
However, there are definitely other rare scenarios where the two cases need to be distinguished with
Option<Vec<Thing>>
—for example, a cryptographic system might need to distinguish between "payload transported
separately" and "empty payload provided". (This is related to the
debates around the NULL
marker for columns in SQL.)
Similarly, what's the best choice for a String
that might be absent? Does ""
or None
make more sense to indicate
the absence of a value? Either way works, but Option<String>
clearly communicates the possibility that this value may
be absent.
Result<T, E>
The second common concept arises from error processing: if a function fails, how should that failure be reported?
Historically, special sentinel values (e.g., -errno
return values from Linux system calls) or global variables (errno
for POSIX systems) were used. More recently, languages that support multiple or tuple return values
(such as Go) from functions may have a convention of returning a (result, error)
pair, assuming the existence of
some suitable "zero" value for the result
when the error
is non-"zero".
In Rust, there's an enum
for just this purpose: always encode the result of an operation that might fail as a
Result<T, E>
. The T
type holds the
successful result (in the Ok
variant), and the E
type holds error details (in the Err
variant) on
failure.
Using the standard type makes the intent of the design clear. It also allows the use of standard transformations (Item 3)
and error processing (Item 4), which in turn makes it possible to streamline error processing with the ?
operator as well.
The situation gets
muddier still if the filesystem is involved, since filenames on popular platforms are somewhere in between arbitrary
bytes and UTF-8 sequences: see the std::ffi::OsString
documentation.
Technically, a Unicode scalar value rather than a code point.
The need to consider all possibilities also means that adding a new
variant to an existing enum
in a library is a breaking change (Item 21): library clients will need to change
their code to cope with the new variant. If an enum
is really just a C-like list of related numerical values, this
behavior can be avoided by marking it as a
non_exhaustive
enum
; see Item 21.