diff --git a/preamble.tex b/preamble.tex index d74d278e2bc121f47b4684293c501019fe4a546a..2fbdace738944358fc7304a19e07ba6f9d996e2f 100644 --- a/preamble.tex +++ b/preamble.tex @@ -21,6 +21,8 @@ \setdefaultlanguage{english} \setotherlanguage{german} +\usepackage{subfiles} + %%% % Styling %%% @@ -51,7 +53,7 @@ ]{markdown} \usepackage{xcolor} -\usepackage{minted} +\usepackage[chapter]{minted} \setminted{autogobble, mathescape} @@ -87,9 +89,9 @@ morekeywords=[2]{bool,i8,i16,i32,i64,i128,isize,u8,u16,u32,u64,u128,usize,f32,f6 \lstset{style=color} % We use `~' as delimiter for the listing because braces don’t -% behave well if there are also braces in th source code. I hope +% behave well if there are also braces in the source code. I hope % that `~' is sufficiently rare in Rust to never encounter it. -\newcommand{{\inline}}[1]{{{\mintinline{rust}~#1~}}} +\newcommand{{\inline}}[2][rust]{{{\mintinline{#1}~#2~}}} % Caption formatting; I can’t get it to align the whole % caption to the left, only the paragraphs within the caption. diff --git a/thesis/1-1-1-ownership.tex b/thesis/1-1-1-ownership.tex new file mode 100644 index 0000000000000000000000000000000000000000..73dda184e7efeb188929f1a1b997df7e67fc36ba --- /dev/null +++ b/thesis/1-1-1-ownership.tex @@ -0,0 +1,183 @@ +% arara: lualatex: {shell: true} +\documentclass[thesis]{subfiles} +\begin{document} + +\subsection{Ownership} + +In most programming languages, one can freely create and delete resources. For example, in C the programmer allocates memory on the heap with \inline[c]{malloc} and frees the memory when they don’t need the memory anymore using \inline[c]{free}. + +\begin{listing}[H] + \begin{minted}{c} + // Allocate space for an `int`. + int *ptr = (int*) malloc(sizeof(int)); + + // ... + + // Free the memory. + free(ptr); + \end{minted} + \caption{Memory management in C} +\end{listing} + +Even in languages without manual memory management, the programmer still must manage other resources. For example, to work with a file in Python one could write: + +\begin{listing}[H] + \begin{minted}{python} + # Open the file and store the file handle in a variable. + file = open("path/to/some.file", "r") + + # ... + + # Close the file + file.close() + \end{minted} + \caption{Manual resource management in Python} +\end{listing} + +\subsubsection{Resource Handling Strategies} + +This manual resource management comes with some difficulties. The programmer has to watch out to always free the resources after using them, because otherwise the program could leak memory or unnecessarily inhibit other processes from accessing the resources. On the other hand, one may also not free the resources too early, which would lead to a use after free, and also not free resources multiple times. This can be very tricky in complex programs where resources are shared between threads or over \acs{API} boundaries. + +That is why several programming languages have ways to make handling resources easier. Python has context managers that close resources automatically. + +\begin{listing}[H] + \begin{minted}{python} + with open("path/to/some.file", "r") as file: + #... + + # At this point `file` is closed. + \end{minted} + \caption{Context managers in Python} +\end{listing} + +This code is equivalent to the previous one, but even better: It also closes the file in case of an exception. Programmers can also define custom context managers for their own resources. + +Other languages have different strategies. Go supports the \inline{defer} statement which takes an operation that is not to be run immediately but at the end of the scope. + +\begin{listing}[H] + \begin{minted}{go} + { // Block starts here. + file, err := os.Open("path/to/some.file") + defer file.Close() + + // ... + + } // Block ends here; run the deferred close file + \end{minted} + \caption{The \inline[go]{defer} statement in Go} +\end{listing} + +These prevent double frees and use after free errors, but they can still be forgotten. A programmer who wants to use a resource must be aware that it has to be cleaned up +in the end and changes in one line of the code have to be mirrored in the other. It is not as bad as in manual resource management because the programmer has to change only one additional line and that line is usually close. + +\subsubsection{\acs{RAII} and Drop Responsibilities} + +A third variant is \acfi{RAII}, also called \acfi{SBRM}, used by C++ and Rust. This ensures that the destructor of an object is run when it goes out of scope, which will clean up automatically. An example is the \inline{std::unique_ptr} in C++. The pointer \emph{owns} the memory it points to, meaning when it is created, it automatically allocates memory on the heap and when it goes out of scope, it frees the memory. + +\begin{listing}[H] + \begin{minted}{cpp} + { // Block starts here. + std::unique_ptr<int> ptr = std::make_unique<int>(42); + + // ... + + } // Block ends here; run the destructor and free the memory + \end{minted} + \caption{\ac{RAII} in C++} +\end{listing} + +\ac{RAII} ensures that resources are always freed when they are not live (that is, may not be used) anymore. The programmer can’t forget to free the resources nor cause use after free or double free. While in C++ only certain types conform to \ac{RAII}, in Rust every value is owned by a variable. For example, the Rust analog to an owning pointer is \inline{Box}. + +\begin{listing}[H] + \begin{minted}{rust} + { // Block starts here. + let ptr = Box::new(42) + + // ... + + } // Block ends here; drop value and free memory + \end{minted} + \caption{Heap allocation in Rust} +\end{listing} + +As soon as a variable falls out of scope, it is \emph{dropped}, meaning all +associated resources are freed. It is said that a variable has a \emph{drop responsibility}. A programmer can define custom drop behavior for their own resources by implementing the \inline{Drop} trait for their type, but normally the compiler does it for us. + +\subsubsection{Move Semantics} + +\ac{RAII} comes with a few disadvantages, though. Since a value is dropped when its owner goes out of scope, there must always be exactly one owner for every value. This means that the compiler will transfer ownership sometimes, like in the following example. + +\begin{listing}[H] + \begin{minted}{rust} + fn print_value(ptr: Box<i32>) { + println!("{}", *ptr); + } + + fn main() { + let mut ptr = Box::new(42); + + print_value(ptr); + + *ptr += 1; + } + \end{minted} + \caption{Ownership transfer} + \label{lst:ownership-transfer} +\end{listing} + +We create a pointer and use it as an argument to a function that prints the pointee’s value. After that we increment the value. But if we try to compile this, we get an error: + +\begin{verbatim} +error[E0382]: use of moved value: `*ptr` +\end{verbatim} + +Because we supplied \inline{ptr} as an argument to \inline{print_value}, it took ownership of that value, and now has the drop responsibility. As soon as the call to \inline{print_value} is finished, the memory is freed and accessing it afterwards is a use after free. We say that the value was \emph{moved} out of \inline{ptr} and into \inline{print_value}. + +\subsubsection{Clone} + +We can still make it work by \emph{cloning} \inline{ptr} and supply the clone as an argument. \inline{ptr.clone()} creates a new \inline{Box}, allocates new memory, and initializes it with the same value as \inline{ptr}’s: + +\begin{listing}[H] + \begin{minted}{rust} + fn print_value(ptr: Box<i32>) { + println!("{}", *ptr); + } + + fn main() { + let ptr = Box::new(42); + + print_value(ptr.clone()); + + *ptr += 1; + } + \end{minted} + \caption{Usage of \inline{Clone}} +\end{listing} + +Now everything works but keep in mind that the clone is completely independent from the original value. If we changed the value of the clone in \inline{print_value}, it would not be visible to the outside. + +\subsubsection{Copy} + +We previously noted that ownership, and therefore move semantics, apply to every value in Rust. But if we change the code to use an \inline{i32} directly instead of putting it on the heap, we don’t need to clone anything: + +\begin{lstlisting}[language=Rust, caption={Number types are \inline{Copy}}] +fn print_value(number: i32) { + println!("{}", number); + } + +fn main() { + let mut number = 42; + + print_value(number); + + number += 1; +} +\end{lstlisting} + +This compiles and works. That is not because \inline{i32} breaks move semantics but because the type implements the \inline{Copy} trait. Normally, if a value is moved, the physical bits making up that value are moved to a new location, e. g. the new stack frame, and are not available at the previous position anymore. If a type implements \inline{Copy}, the bit pattern is instead copied over and retained. Because \inline{i32} doesn’t allocate any heap or handles any resources, such a copy is valid. A \inline{Box} is not \inline{Copy}, because then two owners for the same resource would exist which would violate drop responsibility. If we want to duplicate a \inline{Box}, we need to allocate new memory on the heap and initialize it properly, which is what \inline{clone} does. + +\subsubsection{ManuallyDrop} + +Sometimes we don’t want \ac{RAII} to happen, we want to free resources ourselves. If that is the case, we can wrap a value in a \inline{ManuallyDrop} which tells the compiler to not drop it for us. This is a general trend in Rust: The correct way should be the easiest to do, and all potentially unsafe constructs are opt-in. More information on \inline{ManuallyDrop} can be found in the documentation for the type.\footnote{\url{https://doc.rust-lang.org/stable/std/mem/struct.ManuallyDrop.html}} + +\end{document} \ No newline at end of file diff --git a/thesis/1-1-2-borrowing.tex b/thesis/1-1-2-borrowing.tex new file mode 100644 index 0000000000000000000000000000000000000000..f43ef7b318fff3bada18961228a20626957fb2a5 --- /dev/null +++ b/thesis/1-1-2-borrowing.tex @@ -0,0 +1,147 @@ +% arara: lualatex: {shell: true} +\documentclass[thesis]{subfiles} +\begin{document} + +\subsection{Borrowing} + +As we have seen, ownership and move semantics ensure memory safety. But they are not easy to use, \inline{clone}ing data has a runtime overhead, and many correct programs are rejected. For example, \autoref{lst:ownership-transfer} describes a program that would work if it were not for move semantics. + +That is where references and borrowing come in. Instead of taking ownership of a value, a function can only borrow it through a reference. Then the drop responsibility stays with the caller. References, of course, can not be used for everything, but for our case it is sufficient. We mark the argument to \inline{print_value} as a reference using \inline{&}, and creating a reference from a value works the same. + +\begin{lstlisting}[language=Rust, caption={Borrow using references to prevent a move}, label={lst:borrow}] +fn print_value(ptr: &Box<i32>) { + println!("{}", *ptr); +} + +fn main() { + let mut ptr = Box::new(42); + + print_value(&ptr); + + *ptr += 1; +} +\end{lstlisting} + +\autoref{lst:borrow} compiles without an error and does what we would expect. There are two kinds of references. We just looked at \emph{shared} or \emph{immutable references}. The other kind is \emph{exclusive} or \emph{mutable references}, and they are denoted with \inline{&mut}. The different kinds of references have a different semantics attached to them. While both are used to access values without taking ownership, there are specific rules for the creation and guarantees associated with each: + +\begin{itemize} + \item A shared reference to a value can always be created, as long as there is no exclusive reference to the same value. It is not possible to mutate the pointee through a shared reference (which is why it is also called immutable). + \item An exclusive reference can only be created if there is no other reference to the value at all and the referenced value is declared mutable. An exclusive reference can do everything the owner can, except dropping. +\end{itemize} + +The part of the compiler that enforces these rules is the \emph{borrow checker}. The reasoning behind this is again resource safety, namely preventing \emph{unguarded mutable aliasing}. Having multiple readers of the same data doesn’t cause issues, as long as the data cannot be mutated. Mutating data is fine, as long as no one else can read and/or mutate the data at the same time. Using these two kinds of references enforces, at compile time, that we will always stay on the happy path. But there are some programs that are correct even though they violate these rules. We will be concerned with extending the borrow checker to accept more correct programs in later sections. + +\subsubsection{The Owner of a Borrowed Value} + +References have to agree to these rules, but the owner has to as well. While shared references exist, the owner may not mutate, e. g. \inline{let mut x = 42; let xref = &x; x += 1} is forbidden. Similarly, the owner can’t access a value at all as long as there is an exclusive reference around. + +\todo[inline]{ + \textbf{This doesn’t not work, I have to think about this some more. I want something to throw Error E0507.} + + In particular, a value may not be moved, as long as there is a reference to it because that would create a dangling pointer. This means that \inline{let y = &x; let z = *y;} doesn’t work. + + %\begin{lstlisting}[language=Rust, caption={Moving out of borrowed value}] + % let x = 42; + % let xref = &x; + % let y = x; + %\end{lstlisting} + +} + +\subsubsection{Lexical Lifetimes and Lifetime Analysis} + +The parts of the program in which a reference is valid is called its \emph{lifetime}. The borrow checker keeps track of all references’ lifetimes and their so-called \emph{provenances}, that is, who the original owner of the referenced value is. References are values like all others and so they are dropped at the end of the scope they are defined in. In other words: Their lifetime starts at their creation and ends at the end of the scope. This is called a \emph{lexical lifetime}. Therefore the code in \autoref{lst:lexical-lifetimes} does not compile. + +\begin{lstlisting}[language=Rust, caption={Lexical Lifetimes prevent usage}, label={lst:lexical-lifetimes}] +let mut x = 0; +{ + let xref = &x; // `xref` is created + println!("{xref}"); + x += 1; // error +} // `xref` is dropped +x += 1; // OK, no references exist anymore +\end{lstlisting} + +The program in \autoref{lst:lexical-lifetimes} is rejected by our borrow checker since it registers a mutation to \inline{x} while a reference to it still exists. Only after the block ends, the borrow is returned to the owner and it can be used again. Similarly, we would not be able to create an exclusive reference while a shared one exists and vice versa. + +But we can see that the program is correct since \inline{xref} is never accessed after \inline{x} is changed and we can save the code by introducing an additional scope. + +\begin{lstlisting}[language=Rust, caption={More scopes make the code check}, label={lst:saving-lexical-lifetimes}] +let mut x = 0; +{ + { + let xref = &x; // `xref` is created + println!("{xref}"); + } // `xref` is dropped + x += 1; +} +x += 1; +\end{lstlisting} + +This is tedious, though, and we would like the borrow checker to recognize patterns like this automatically. Luckily, Rust implements an improved borrow checker since version 1.31. Because it is not based on lexical scope for lifetime analysis anymore, these new lifetimes are called \acfi{NLL}. We will take a closer look at them in \autoref{subsec:non-lexical-lifetimes}. + +\subsubsection{Interior Mutability and Access Guards} + +The borrow rules are enforced at compile time, but sometimes the compiler can’t know if a piece of code follows the rules. Because of this, there is a runtime-checked version of reference with \inline{RefCell} which causes a panic if the borrow rules are broken. Also, sometimes we need access to a resource from multiple places at the same time, for example when sharing data between threads. For this, Rust provides the \inline{Mutex} container type. References to a mutex can be shared freely, but to change the value in the container, one has to acquire a lock, therefore making the access \emph{guarded}. While the mutex is locked, no other thread can access the data at all, as if the thread held a \inline{&mut}. Alternatively, \inline{RwLock} can give out both read and write locks which have behavior analogous to \inline{&} and \inline{&mut}, respectively. There are more constructs for similar use cases in the standard library, like \inline{Arc} and \inline{Cow}. + +These types are implemented using the \inline{Cell} type which can change a value even if it is not marked as mutable. This is called \emph{interior mutability} and uses \emph{unsafe Rust} internally, a superset of standard, safe Rust. The main feature of unsafe Rust is access to so-called \emph{raw pointers}. Raw pointers are, like pointers in C, not borrow checked by the compiler. Since borrow checking is undecidable, the compiler sometimes can’t prove that, for example, two references don’t alias. In these cases, the programmer can step in, prove the non-aliasing manually, implement a feature using raw pointers and provide a safe abstraction for consumers of the code.\todo{Pointers to Strict Provenance, Miri, …?} + +\subsubsection{Returning references and Borrow-through} + +Functions can receive references as arguments, but they can also return references. One has to be a bit careful when doing this, though, since all resources created in the scope of a function are freed as soon as the function returns. Consider the following: + +\begin{lstlisting}[language=Rust, caption={Try to return a reference}] +fn to_ref(number: i32) -> &i32 { + &number +} +\end{lstlisting} + +This fails because \inline{number} is owned by \inline{to_ref} and is dropped as soon as the function returns. The reference would already be invalid when the caller gets access to it. But if the function takes a reference as an argument, it can pass another reference back to the caller. This is called a \emph{borrow-through} and looks like the following: + +\begin{lstlisting}[language=Rust, caption={Borrow-through}] +fn to_ref(number: &i32) -> &i32 { + &number +} +\end{lstlisting} + +Now Rust can couple both references to each other. The returned reference has the same provenance as the input one, and the borrow checker can check that the input reference’s lifetime is at least as long as the returned one. + +\subsubsection{Lifetime Polymorphism} + +Consider a function that takes two references and returns one reference itself. Imagine for example, we have two slices and want to create an iterator over all elements of both slices. Then we could try to write a function like in \autoref{lst:two-refs}. The \inline{impl Iterator<Item=&T>} is a so-called existential type. It just means that we don’t care about what type exactly it is as long as it is an iterator. + +\begin{lstlisting}[language=Rust, caption={Iterator over two slices}, label={lst:two-refs}] +fn both(first: &[i32], second: &[i32]) + -> impl Iterator<Item=&i32> +{ + first.iter().chain(second.iter()) +} +\end{lstlisting} + +This does not work however. + +\begin{lstlisting} +error[E0106]: missing lifetime specifier +\end{lstlisting} + +The compiler is confused because it does not know which provenance and lifetime to assign to the returned references. We as programmers can see that the return value depends on both input references, so we can help Rust by providing lifetime hints.\footnote{Actually, the compiler could infer these lifetimes but it would rely on global program analysis to do so. It is also possible to infer function types, but Rust chose to always be explicit about the types and lifetimes you use and have them be explicit parts of the API of a function.} + +\begin{lstlisting}[language=Rust, caption={Iterator over two slices}, label={lst:lifetime-parameters}] +fn both<'a>(first: &'a [i32], second: &'a [i32]) + -> impl Iterator<Item=&'a i32> +{ + first.iter().chain(second.iter()) +} +\end{lstlisting} + +\inline{'a} is a \emph{polymorphic lifetime parameter}. We tell Rust that the output relies on both inputs being valid for at least as long as itself. If the output relied on only one of the arguments, we would mark only one of them with a lifetime specifier. This would for example be the case if we want to search for an element in a slice and, if it exists, return a mutable reference to that element. We implement such a function in \autoref{lst:two-refs-one-dependent}. + +\begin{lstlisting}[language=Rust, caption={The output lifetime depends only on one of the inputs}, label={lst:two-refs-one-dependent}] +fn find_mut<'a>(haystack: &'a mut [i32], needle: &i32) + -> Option<&'a mut i32> +{ + haystack.iter_mut().find(|x| **x == *needle) +} +\end{lstlisting} + +\end{document} \ No newline at end of file diff --git a/thesis/1-1-3-reference-conversion.tex b/thesis/1-1-3-reference-conversion.tex new file mode 100644 index 0000000000000000000000000000000000000000..0345cc4c9c6d70b74beb942af1b700c7c467b34b --- /dev/null +++ b/thesis/1-1-3-reference-conversion.tex @@ -0,0 +1,45 @@ +% arara: lualatex: {shell: true} +\documentclass[thesis]{subfiles} +\begin{document} + +\subsection{Reference Conversions} + +Sometimes, we have a reference to one type but need a reference to another, similar type. For example, a vector of some type \inline{Vec<T>} is conceptually the same as a slice of that same type \inline{[T]}, except that a \inline{Vec} can grow and shrink and a slice can not. This means that all functions which operate on slices should also work with vectors, and in fact there is a library method \inline{Vec::as_slice} that takes a reference to a vector and provides a reference to a slice. Similarly there is \inline{String::as_str} which transforms a \inline{&String} into a \inline{&str}. + +\subsubsection{AsRef} + +Generally, there are many types that can act as a substitute for another. The common interface for this behavior is the \inline{AsRef} trait. There can be many implementations of this trait for a given type. for example, \inline{String} can stand in for \inline{str}, \inline{[u8]}, \inline{OsStr} and \inline{Path}. Every time a reference to one of these types are needed, we can use a \inline{String} instead, if we first call \inline{as_ref} on it: + +\begin{lstlisting}[language=Rust, caption={Use \inline{String} and \inline{as_ref} in place of \inline{[u8]}}, label={lst:as_ref_bytes}] +fn needs_bytes(x: &[u8]) { /*...*/ } + +// ... + +let s = String::from("Hello Bytes"); +needs_bytes(s.as_ref()); +\end{lstlisting} + +\subsubsection{Deref} + +Always calling \inline{as_ref} is a bit cumbersome, especially if it is clear which type the target should be. For example, if we have a \inline{Box<T>} there is really only one reasonable type we want to get out of it, namely a \inline{&T}. The trait \inline{Deref} can be used for that. It specifies a single type which Rust automatically converts into if it is needed. For example, \inline{String} has \inline{str} as its \inline{Deref} target which allows us to omit the \inline{as_ref} in that case and leave the conversion implicit: + +\begin{lstlisting}[language=Rust, caption={Use \inline{String} in place of \inline{[u8]}}] +fn needs_str(x: &str) { /*...*/ } + +// ... + +let s = String::from("Hello Bytes"); +needs_bytes(s); +\end{lstlisting} + +We couldn’t do the same in \autoref{lst:as_ref_bytes} because \inline{[u8]} is not \inline{Deref} target of \inline{String}. + +\subsubsection{Borrow} + +Sometimes we want to express an even stronger connection between two types. For example, a \inline{HashMap} needs to take ownership of its entries but we want to be able to do a lookup even if we only have a reference. But that requires that the owned and the borrowed value behave exactly the same. This is what the \inline{Borrow} trait signals. In particular, \inline{x.borrow() == y.borrow()} if and only if \inline{x == y} and \inline{x.hash() == x.borrow().hash()}. + +\subsubsection{Mutable reference conversions} + +All three of the aforementioned traits work for shared references only, but their variants \inline{AsMut}, \inline{DerefMut} and \inline{BorrowMut} all take and provide an exclusive reference. + +\end{document} \ No newline at end of file diff --git a/thesis/1-1-4-non-lexical-lifetimes.tex b/thesis/1-1-4-non-lexical-lifetimes.tex new file mode 100644 index 0000000000000000000000000000000000000000..597011ec9472d5281ebc03e0440146e373e96db4 --- /dev/null +++ b/thesis/1-1-4-non-lexical-lifetimes.tex @@ -0,0 +1,75 @@ +% arara: lualatex: {shell: true} +\documentclass[thesis]{subfiles} +\begin{document} + +\subsection{Non-Lexical Lifetimes}\label{subsec:non-lexical-lifetimes} + +In \autoref{lst:lexical-lifetimes} we saw how lexical lifetimes stood in our way and we had to manually wrap a reference in an additional block just to make our correct program pass the borrow checker. This was because the lifetime of any value is bound to its scope, at the end of which it is dropped. \acl{NLL} are a different approach to lifetime analysis. A reference is live for as long as it may be used later. The borrow checker determines the points of the program in which the reference is live by building a \acfi{CFG}. All nodes of the \ac{CFG} that can be reached from the point of creation until the last use of the reference are where it is live. Consider the program in \autoref{lst:non-lexical-lifetimes}. + +\begin{lstlisting}[language=Rust, caption={Simple \ac{NLL} example}, label={lst:non-lexical-lifetimes}] +let mut x = 0; +let xref = &x; +println!("{xref}"); +x += 1; +\end{lstlisting} + +Then the (simplified) \ac{CFG} looks something like \autoref{fig:cfg-non-lexical-lifetimes}. It is quite boring because the control flow is linear. The nodes with thick borders are where \inline{xref} is live. The first node is the one in which the reference is created and the second one is where it is last used. Because \inline{x} is not modified in that section of the \ac{CFG}, the borrow checker doesn’t complain. + +\begin{figure}[!ht] + \centering + \begin{tikzpicture} + \begin{scope}[every node/.style={draw, rectangle}] + \node (A) at (0,4.5) {\inline{let mut x = 0;}}; + \node[ultra thick] (B) at (0,3) {\inline{let xref = &x;}}; + \node[ultra thick] (C) at (0,1.5) {\inline{println!("\{xref\}");}}; + \node (D) at (0,0) {\inline{x += 1;}}; + \end{scope} + \path [->] (A) edge node[left] {} (B); + \path [->, ultra thick] (B) edge node[left] {} (C); + \path [->] (C) edge node[left] {} (D); + \end{tikzpicture} + \caption{Control-flow diagram for \autoref{lst:non-lexical-lifetimes}} + \label{fig:cfg-non-lexical-lifetimes} +\end{figure} + +Consider a more complicated example that includes branching in \autoref{lst:non-lexical-lifetimes-branch}. + +\begin{lstlisting}[language=Rust, caption={\ac{NLL} example with braching}, label={lst:non-lexical-lifetimes-branch}] +let mut x = 0; +let xref = &x; +if xref != &0 { + x += 1; + println!("xref equals {xref}"); +} else { + println!("xref is 0"); +} +x = 42; +\end{lstlisting} + +Now the \ac{CFG} (\autoref{fig:cfg-non-lexical-lifetimes-branch}) splits up to accommodate both possible paths the program could take during execution. At the \inline{if}, the graph splits into two, but in the last node, both paths join up again. Here we can see that in the \inline{false} branch no problem occurs, but in the \inline{true} branch we try to modify \inline{x} even though \inline{xref} is used later in the same path. Modifying \inline{x} on the last line does not pose any problems since xref is not live anymore on any path that leads to this point. + +\begin{figure}[!ht] + \centering + \begin{tikzpicture} + \begin{scope}[every node/.style={draw, rectangle}] + \node (A) at (0,7.5) {\inline{let mut x = 0;}}; + \node[ultra thick] (B) at (0,6) {\inline{let xref = &x;}}; + \node[ultra thick] (C) at (0,4.5) {\inline{if xref != &0 }}; + \node[ultra thick, draw=red] (D) at (-3,3) {\inline{x += 1;}}; + \node[ultra thick] (E) at (-3,1.5) {\inline{println!("xref equals \{xref\}");}}; + \node (G)[ultra thick] at (3,2.5) {\inline{println!("xref is 0");}}; + \node (H) at (0,0) {\inline{x = 42;}}; + \end{scope} + \path [->] (A) edge node[left] {} (B); + \path [->, ultra thick] (B) edge node[left] {} (C); + \path [->, ultra thick] (C) edge node[left] {\inline{true}} (D); + \path [->, ultra thick] (D) edge node[left] {} (E); + \path [->, ultra thick] (C) edge node[right] {\inline{false}} (G); + \path [->] (G) edge node[left] {} (H); + \path [->] (E) edge node[left] {} (H); + \end{tikzpicture} + \caption{Control-flow diagram for \autoref{lst:non-lexical-lifetimes-branch}} + \label{fig:cfg-non-lexical-lifetimes-branch} +\end{figure} + +\end{document} \ No newline at end of file diff --git a/thesis/1-1-5-reborrows.tex b/thesis/1-1-5-reborrows.tex new file mode 100644 index 0000000000000000000000000000000000000000..974aff481c6a32f1eee2834328efc5c3f2dab098 --- /dev/null +++ b/thesis/1-1-5-reborrows.tex @@ -0,0 +1,80 @@ +% arara: lualatex: {shell: true} +\documentclass[thesis]{subfiles} +\begin{document} + +\subsection{Reborrows and Two-Phase Borrowing} + +\ac{NLL} give us a lot more freedom when using references, but there are still programs that are clearly correct but don’t pass the borrow checker, especially if exclusive references are in play. It is not possible to create a new reference of some provenance as long as an exclusive reference with the same provenance is live. This means the code in \autoref{lst:conflicting-borrows} will not compile. + +\begin{lstlisting}[language=Rust, caption={Cannot create a shared reference while an exclusive one exists}, label={lst:conflicting-borrows}] +let mut x = 1; +let ref1 = &mut x; +let ref2 = &x; // Error: Cannot borrow `x` +println!("{ref2}"); +*ref1 = 2; +\end{lstlisting} + +But since \inline{xs} is not live anymore when \inline{xu} is used, aliasing is never occurring, and if we hadn’t used an exclusive reference but modified the value directly, Rust would have been able to see this. Not being able to create references in this manner is a big thing. Consider the code in \autoref{lst:moved-borrow}. Here, we create a reference to an empty \inline{Vec} and then use it to \inline{push} a value. This moves the reference into the \inline{push} method so that it is dropped when the method returns. It is no longer valid to use it again in the next line since it has already been dropped. But in fact this code compiles because Rust implicitly inserts a \emph{reborrow} for us. + +\begin{lstlisting}[language=Rust, caption={Move a borrow into a function}, label={lst:moved-borrow}] +let mut v = Vec::new(); +let vref = &mut v; +vref.push(1); +vref.last(); // Error: Use of moved value `vref` +\end{lstlisting} + +\subsubsection{Reborrows} + +Let’s get back to the example in \autoref{lst:conflicting-borrows} for now. The compiler complains because we try to create two conflicting references with the same provenance. But we can tell the compiler to temporarily deactivate a reference by borrowing \emph{through this reference}. This is done in \autoref{lst:simple-reborrow}. + +\begin{lstlisting}[language=Rust, caption={Reborrow through an exclusive reference}, label={lst:simple-reborrow}] +let mut x = 1; +let ref1 = &mut x; +let ref2 = & *ref1; // Reborrow +println!("{ref2}"); +*ref1 = 2; +\end{lstlisting} + +Now we can create and use the \inline{ref2} as long as it is not live anymore when we use \inline{ref1} again. If the last two lines were swapped, the borrow checker would correctly find the interleaved use and throw an error. It is possible to nest reborrows, as long as they are never used interleaved. These reborrows can be imagined as a stack: when creating a reborrow, the new reference is pushed to the stack, and when using a reference, the stack is popped until it is on the topmost position. Trying to access a reference that has already been popped indicates an interleaved use and is a borrow error. The owner of the value can then be thought of as the bottom element of the stack, and when it goes out of scope, the whole stack is unwound so that the \inline{drop} function can access it. In fact, this is exactly how reborrows are modelled in Miri, which is an interpreter for Rusts \acfi{MIR}. \todo{Source, e. g. Rustonomicon} Miri even has borrow stacks for raw pointers. + +But why does the code in \autoref{lst:moved-borrow} work? The compiler can insert borrows and reborrows in function calls, so-called \emph{autorefs}. This is because it can be sure that the function returns before the reference is accessed again, meaning that the borrow stack is kept intact. In fact, if the compiler can’t proof that the borrow stack is valid, it will complain. This can for example happen when using \inline{std::thread::spawn} which creates a new thread. Because the child thread could outlive the main thread, Rust can’t verify that all references are used in the correct order, in particular that the owner is not dropped too early. + +\subsubsection{Two-Phase Borrowing} + +Reborrows won’t solve every problem, though. Consider the code in \autoref{lst:two-phase-borrow}. Here we create a vector and then push its length onto it. + +\begin{lstlisting}[language=Rust, caption={Borrow twice in one method call}, label={lst:two-phase-borrow}] +let mut v = Vec::new(); +v.push(v.len()); +\end{lstlisting} + +The problem lies in the second line. The first argument to \inline{push} is an \inline{&mut self}, and so the compiler implicitly borrows from \inline{v}. But then another (shared) reference is created for the call to \inline{len}. This is not allowed, and a reborrow doesn’t help either since the \inline{&mut self} is currently used. On the other hand, it is clear that the call to \inline{len} will definitely return before the exclusive reference is ever accessed. In fact, we can work around this problem if we save the result of \inline{v.len()} in a temporary variable, as shown in \autoref{lst:two-phase-borrow-workaround}. + +\begin{lstlisting}[language=Rust, caption={Workaround for \autoref{lst:two-phase-borrow}}, label={lst:two-phase-borrow-workaround}] +let mut v = Vec::new(); +let vlen = v.len(); +v.push(vlen); +\end{lstlisting} + +This pattern—calling a method with an exclusive reference but also using shared references in the arguments—is very common and the workaround is unwieldy. Therefore \acs{RFC} 2025 \cite{rfc2025nestedmethodcalls} introduces the concept of a \ac{TPB}. In it, the lifetime of an exclusive reference is split up into two phases. During the \emph{reservation phase}, it is treated like a shared reference, meaning more shared references can be created and reads are possible. The \emph{activation} happens as soon as the reference is first used to perform a mutation. From this point on, it is treated as a full exclusive reference. + +\subsubsection{Generalized \ac{TPB}} + +Right now, a \ac{TPB} may only happen in a few specific places, namely when calling a method with \inline{&mut self} as first argument and with assignment operators (\inline{+=}, \inline{*=}, …), and the other references can only be shared. There are several ways in which the requirements can be relaxed and which are discussed in the RFC and Issue \#49434\footnote{\url{https://github.com/rust-lang/rust/issues/49434}}. + +Currently, the only place an exclusive reference is allowed is in \inline{&mut self} position. This means that it is not possible to use an exclusive reference twice, and a reborrow is also not possible because the references have no names. One could relax this and allow exclusive references everywhere, and also mix shared and exclusive references. This poses the difficulty that a shared reference (or one during reservation) can observe change. Resolving this makes the model more complicated. + +In a similar vein, \ac{TPB} could be expanded to all function calls or even in inline code, as shown in \autoref{lst:inline-tpb}. Here, creating a shared reference is not allowed because an exclusive one already exists and is live. With generalized \ac{TPB} it would be okay, though, since \inline{xm} is only in reservation phase at that point. + +\begin{lstlisting}[language=Rust, caption={Example of inline \ac{TPB}}, label={lst:inline-tpb}] +let mut x = 1; +let xm = &mut x; +let xr = &x; // Error +*xm = 2; +\end{lstlisting} + +Even further, \emph{discontinuous borrows} would make it possible that a borrow is only active at exactly the points at which it is used and deactivated in between. Also, if the mutation capabilities of the reference is not used anymore, it could be \emph{downgraded} to a shared reference: For example, a function could take an exclusive reference to a value but then returns only a shared one. + +All of these changes would entail a big step up in complexity and surface area of the language without actually having much practical use since relevant situations are rare. They are discussed in detail in \cite{rfc2025nestedmethodcalls}. + +\end{document} \ No newline at end of file diff --git a/thesis/1-1-6-loans-and-regions.tex b/thesis/1-1-6-loans-and-regions.tex new file mode 100644 index 0000000000000000000000000000000000000000..475e7f5ca4d961a31e79520c7a9caabea913ee8c --- /dev/null +++ b/thesis/1-1-6-loans-and-regions.tex @@ -0,0 +1,126 @@ +% arara: lualatex: {shell: true} +\documentclass[thesis]{subfiles} +\begin{document} + +\subsection{Loans and Regions} + +Previously, we looked at the advantages of \acl{NLL} over standard lexical ones. But lifetimes have a general defect and are supposed to be replaced by an approach using \emph{regions} employed by the \emph{Polonius borrow checker}, which is currently in an experimental stage. Here, we will explore how \ac{NLL} and regions work under the hood, and what the advantage of using regions is. + +% \begin{lstlisting}[language=Rust, caption={Simple \ac{NLL} example}, label={lst:non-lexical-lifetimes}] +% let mut x = 0; +% let xref = &x; +% println!("{xref}"); +% x += 1; +% \end{lstlisting} + +\subsubsection{Borrow Errors} + +We need to define some vocabulary first: A \emph{path} is an identifier like \inline{x}, or is built from a path by a field access \inline{x.f}, a dereference \inline{*x}, or an index \inline{x[i]}. Those can be freely combined, so that for example \inline{(*x)[5].f} is a valid path.\footnote{Paths have a rough equivalent in C and C++ lvalues.} + +A \emph{loan} is the result of a borrow expression. It consists of the path which is borrowed from and a \emph{mode}, that is shared or exclusive. + +A loan $L$ is \emph{violated} if its path $P$ is accessed in an incompatible way, that is, $P$ is mutated when the $L$ is shared, or $P$ is accessed at all when $L$ is exclusive. Note that an access can also be \emph{indirect} if $P$ is shows up somewhere in the expression. For example, \inline{(*x)[5].f} accesses \inline{(*x)[5]}, \inline{*x}, and \inline{x} indirectly. Note also that a loan to an index ignores the index variable, that is \inline{x[5]} and \inline{x[4]} produce the same loan to \inline{x[_]}. This is because Rust can generally not know at compile time if two index operations alias. It means that it is impossible to have two exclusive references to different parts of a data structure like in \autoref{lst:two-indices}. + +\begin{lstlisting}[language=Rust, caption={Indexing twice into a vector is illlegal}, label={lst:two-indices}] +let mut v = vec![1, 2]; +two_refs(&mut v[0], &mut v[1]); +\end{lstlisting} + +Now we can define when a borrow error should occur. There are three conditions which all have to be met: + +\begin{enumerate} + \item A path $P$ is accessed at some node $N$ of the \ac{CFG}, + \item accessing $P$ at $N$ would violate some loan $L$, and + \item $L$ is live at $N$. +\end{enumerate} + +Different approaches to borrow checking only differ in determining when $L$ is live. For example, with lexical lifetimes a loan is simply live from its creation until the end of the lexical scope. We are now prepared to dive into liveness analysis in \ac{NLL} and Polonius. + +\subsubsection{Classic \ac{NLL}} + +Under \ac{NLL} the liveness of a loan is derived from the lifetime of a reference. As discussed in \autoref{subsec:non-lexical-lifetimes}, a reference is live in a node of the \ac{CFG} if it may be used later. This means that we walk forward along the \ac{CFG} to determine the liveness of the reference and the corresponding loan is live exactly when the reference is. Crucially, if a function returns a reference, it is live for the whole body of the function. \todo{lifetime subtyping and inferred lifetimes/constraints, “outlive†relationship, \inline{'a : 'b}} + +\begin{lstlisting}[language=Rust, caption={\ac{NLL} reject correct program}, label={lst:nll-reject-correct}] +fn first_or_insert(v: &mut Vec<i32>) -> &i32 { + let fst = v.first(); + match fst { + Some(x) => x, + None => {v.push(1); &v[0]}, + } +} +\end{lstlisting} + +\autoref{lst:nll-reject-correct} shows an example. In the \inline{Some} branch, \inline{x} is returned from the function, which in turn depends on the reference produced by \inline{fst}. Because \inline{x} is returned from the function, it needs to be live at least until the end of the body of \inline{first_or_insert}. But since \inline{x} is derived from \inline{fst}, that reference must outlive \inline{x}, hence being live for the whole function body as well. In the \inline{None} branch, an exclusive reference to \inline{v} is created for the call to \inline{push}. This produces an error because that node lies on a path between the creation of \inline{fst} and the return point.\todo{Do the listing again with lifetime annotations?} + +This should not happen. We can see that \inline{fst} is not actually used when we go through the \inline{None} arm because a different reference is returned in that case. However, \ac{NLL} can’t accommodate this situation because \inline{fst} \emph{may} be used later, see \autoref{fig:cfg-nll-reject-correct}. A similar problem for map data structures has led to the \inline{Entry} \ac{API} which uses unsafe Rust to work around this limitation. + +\begin{figure}[!ht] + \centering + \begin{tikzpicture} + \begin{scope}[every node/.style={draw, rectangle}] + \node[ultra thick] (Z) at (0,4) {\inline{let fst = v.first();}}; + \node[ultra thick] (A) at (0,3) {\inline{match fst}}; + \node[ultra thick] (B) at (-1,1.5) {\inline{x}}; + \node[ultra thick, draw=red] (C) at (1,2) {\inline{v.push(1);}}; + \node[ultra thick] (D) at (1,1) {\inline{&v[0]}}; + \node[ultra thick] (E) at (0,0) {\inline{return}}; + \end{scope} + \path [->, ultra thick] (Z) edge node[left] {} (A); + \path [->, ultra thick] (A) edge node[left] {\inline{Some}} (B); + \path [->, ultra thick] (A) edge node[right] {\inline{None}} (C); + \path [->, ultra thick] (C) edge node[left] {} (D); + \path [->, ultra thick] (B) edge node[left] {} (E); + \path [->, ultra thick] (D) edge node[left] {} (E); + \end{tikzpicture} + \caption{Control-flow diagram for \autoref{lst:nll-reject-correct}.} + \label{fig:cfg-nll-reject-correct} +\end{figure} + +\subsubsection{Polonius} + +With lifetimes, we looked \emph{forwards} from the borrow expression to see how long a reference (and therefore the loan) is live. Polonius goes \emph{backwards} from a point of use to see if there is still a live reference. Polonius doesn’t use lifetimes but \emph{regions} to determine the liveness of a loan. + +A \emph{region}\footnote{Polonius calls regions \emph{origins}, which is a more telling name, but regions is the more standard term.} is a set of loans. Each reference has a region consisting of all loans it may depend on. A fresh reference created from an owned value has a region consisting of just one loan, but a reference, e. g. if returned by a function, could depend on several inputs and therefore have a region of several loans. Note that in this step we don’t care about how long a reference is valid, we don’t go forward in the \ac{CFG}. Instead, we only consider previous nodes to determine regions. + +Now, a loan $L$ is live at some node $N$, if there is some variable which is live at $N$ and contains $L$ in its region. This difference means that different paths through the \ac{CFG} are independent from each other, because a node in one path can’t see a node in the other one by walking back the \ac{CFG}. + +Let’s look at the example from \autoref{lst:nll-reject-correct} with all regions made explicit. \lstinline~x'{L1, L2}~ denotes that expression \inline{x} has a region consisting of the two loans \inline{L1} and \inline{L2}. + +\begin{lstlisting}[language=Rust, caption={Example with region annotations}, label={lst:regions}] +fn first_or_insert(v'{L0}: &mut Vec<i32>) + -> &'{L0, L1, L3} i32 +{ + let fst'{L0, L1} = v.first()'{L0, L1}; + match fst { + Some(x'{L0, L1}) => x, + None => {v.push(1)'{L0, L2}; &v[0]'{L0, L3}}, + } +} +\end{lstlisting} + +In \autoref{lst:regions} we can see that there are four relevant loans: \inline{L0} is the loan of the reference we got passed in. All references depend on it. \inline{first} creates a reference with loan \inline{L1} that is returned in the \inline{Some} branch, \inline{push} implicitly reborrows to push a value onto \inline{*v}. The final reference is created by the index operation and it may also be returned. Therefore, the return value has a region \lstinline~'{L0, L1, L2}~, because those three loans are what it may depend on. + +Under \ac{NLL}, the \inline{push} was not possible because \inline{x} being live and depending on \inline{fst} meant that \inline{fst} was live. With Polonius, we must check if there is any live variable that has a nonempty intersection with \lstinline~'{L0, L2}~. \inline{fst} and \inline{x} are not live, so they don’t pose a problem, even if the regions overlap. \inline{v} is live and there is a region overlap, but since the compiler inserts a reborrow, it is not a problem. There could still be an error if the borrow stack were invalidated at a later point, but since Polonius is only looking backwards, this is not something we have to consider here. There are no more live variables, so the node passes the borrow check. + +\subsubsection{Self-referential Structs} + +Sometimes we want to have structures that contain references to parts of itself.\footnote{A real-world example of this are futures which must store the point of execution they are in. Currently they are a special case in the compiler and can’t be expressed in user-code.} Consider the struct in \autoref{lst:self-referential}, in which we store some data and additionally provide a view into a part of the data. The \inline{window} field contains a reference and it must be tied to soe value on the stack, but it is impossible to assign a lifetime to it. Putting a \inline{'data} is not valid Rust at the moment. + +With regions, however, this could work. Polonius could check that when creating a \inline{View}, the reference supplied to the constructor is pointing to the \inline{data} argument. + +\begin{lstlisting}[language=Rust, caption={Self-referential struct}, label={lst:self-referential}] +struct View<T> { + data: Vec<T>, + window: &'data [T] +} +\end{lstlisting} + +\subsubsection{Polonius the Crab} + +Since Polonius is still unstable and will be so for the forseeable future, there is a library\footnote{\url{https://docs.rs/polonius-the-crab/0.3.0/polonius_the_crab/}} defining a macro which implements the Polonius logic in stable Rust, using some unsafe code. + +\subsubsection{Cyclone} + +\emph{Cyclone} \cite{grossman2002region} is a dialect of C that introduces additional safe pointer types which are checked using regions. Instead of Polonius’ abstract regions, the regions in Cyclone represent concrete parts of memory. + +\end{document} \ No newline at end of file diff --git a/thesis/1-2-1-contract-programming.tex b/thesis/1-2-1-contract-programming.tex new file mode 100644 index 0000000000000000000000000000000000000000..e80ab5b4fab0d504a2ee484dfaf37d057f24ebac --- /dev/null +++ b/thesis/1-2-1-contract-programming.tex @@ -0,0 +1,17 @@ +\subsection{Contract Programming} + +\todo[inline]{Example: Assertions} + +Programming languages like Eiffel and D have native support for contract programming. There is also a library\footnote{Found under \url{https://gitlab.com/karroffel/contracts}} that brings design by contract to Rust. Contract programming enables three tools: + +\begin{enumerate} + \item \emph{Preconditions} are checked at the beginning of some piece of code, + \item \emph{Postconditions} are checked after some piece of code has run, and + \item \emph{Invariants} must always hold for some piece of data. +\end{enumerate} + +\todo[inline]{Example bank transfer: A bank account may never have a balance below lino of credit, a transfer precondition is that there is enough money in the first account, a postcondition is that the balancs are updated. Use Eiffel, D, or Rust/contracts?} + +\todo[inline]{loop invariants/conditions} + +\todo[inline]{Higher-order contracts: \href{https://www2.ccs.neu.edu/racket/pubs/icfp2002-ff.pdf}{Findler, Felleisen}} \ No newline at end of file diff --git a/thesis/1-2-2-refinement-types.tex b/thesis/1-2-2-refinement-types.tex new file mode 100644 index 0000000000000000000000000000000000000000..75847f268cb4acc35dd2d186bca525bd1b20386a --- /dev/null +++ b/thesis/1-2-2-refinement-types.tex @@ -0,0 +1,108 @@ +% arara: lualatex: {shell: true} +\documentclass[thesis]{subfiles} +\begin{document} + +\subsection{Refinement Types} + +Since the inception of set theory, mathematicians used \emph{set comprehension} to build sets. For example, + +\[ + \Set{ n | n \text{ is a prime number}} +\] + +\noindent denotes the set of all prime numbers. On the left-hand side of the $|$ is a comprehension variable and on the right-hand side is a predicate. The above is an example of \emph{unrestricted comprehension} because $n$ can stand for any mathematical object. Allowing unrestricted comprehension leads to Russel’s paradox, though, which is why it is not used in modern mathematics anymore. There is a weaker version of comprehension which works well, \emph{restricted comprehension}. In it, the variable has to be restricted to a set, like the following: + +\[ + \Set{ n ∈ ℕ | n \text{ is a prime number}} +\] + +\noindent This means that for every set $X$ and every predicate $P$, + +\[ + \Set{x ∈ X | P(x)} ⊆ X +\] + +\noindent which circumvents the paradox. \emph{Refinement types} are the type theoretic analogue to restricted comprehension. Some programming languages allow for refinement types. For example, in the Lean theorem prover, the introductory example would translate to + +\begin{listing}[H] + \begin{minted}[mathescape]{lean} + { n : $\BbbN$ // prime n } + -- Syntactic sugar for: + subtype nat prime + \end{minted} + \caption{Refinement types in Lean} + \label{lst:lean-subtype} +\end{listing} + +\inline{subtype} is a type constructor, that is a generic type, with two arguments. The first argument \inline{nat} is a type, but the second argument \inline{prime} is a value of function type. Rust has limited support for types depending on values with \emph{const generics}, but it does not support functions as const generics, which means that refinement types can’t be implemented in Rust this way.\todo{More on full dependent types, Pi and Sigma types, lambda cube?} There have been proposals for introducing more const generic capabilities to Rust\footnote{\url{https://github.com/ticki/rfcs/blob/pi-types/text/0000-pi-types.md}}, but they are put into cold storage indefinitely. We can still imagine what the equivalent Rust code would be if it could be written: + +\begin{listing} + \begin{minted}{rust} + struct Prop<const B: Bool>; + + impl<const B: bool> Prop<B> { + const fn new() -> Self { + if B { + Self + } else { + panic!() + } + } + } + + struct Subtype<T, const pred: fn(T) -> bool> { + val: T, + proof: Prop<pred(val)> + } + + impl<T, const pred: fn(T) -> bool> Subtype<T, pred> { + const fn new(val: T) -> Self { + Self { + val, + proof: Prop::new::<pred(val)>() + } + } + } + + // Also implement `Deref` and `DerefMut` so that a + // `Subtype<T, pred>` can be used in any place where a + // `T` is expected. + \end{minted} + \caption{The Rusty equivalent.} +\end{listing} + +\noindent\inline{Prop} is a helper type which can only be created from a true value. We need this so that we can proof that \inline{val} makes \inline{prop} true. This would now allow us to create constants that have certain proven properties like in \autoref{lst:proven-const}. + +\begin{lstlisting}[language=Rust, caption={Constants with proven properties}, label={lst:proven-const}] +const PRIME: Subtype<i32, is_prime> = Subtype::new(17); +const SMALL: Subtype<u32, |x| x <= 255> = Subtype::new(9); +\end{lstlisting} + +\noindent Because this is cumbersome, it would also be nice to extend the syntax for easier definition of a refinement type and let the compiler automatically insert the \inline{Subtype::new}. + +\begin{lstlisting}[language=Rust, caption={Constants with proven properties}, label={lst:proven-const-pretty}] +const PRIME: Subtype where is_prime = 17; +const SMALL: Subtype where |x| x <= 255 = 9; +\end{lstlisting} + +\noindent Refinements could of course only be checked at compiletime for constants. If a user wants to create a runtime value, the checks are also at runtime. + +\subsubsection{Refinement Types in Functions} + +In the previous example, the predicate only took one input, namely the comprehension variable itself. And as long as only simple types are refined, this is the only possibility. But refinement types really shine when they depend on multiple inputs. The prime example is a function concatenating two arrays: + +\begin{lstlisting}[language=Rust, caption={Concatenate two arrays, with a deprendent output type}, label={lst:concat-refined}] +fn concat<T, const N: usize, const M: usize>( + first: &[T; N], + second: &[T; M], +) -> [T; N + M] { + // ... +} +\end{lstlisting} + +\noindent Here, the type of the return value depends on the generic values of the inputs. This, again, only works if \inline{N} and \inline{M} are known at compile time. + +\todo[inline]{Runtime dependent function (e. g. reverse?)} +\todo[inline]{Intersection types} + +\end{document} \ No newline at end of file diff --git a/thesis/1-introduction.tex b/thesis/1-introduction.tex index fd446a9395576c35a521ac302e492c80c088dd31..952072fabe2ff5b1f37b68c28a3a41acb536acaf 100644 --- a/thesis/1-introduction.tex +++ b/thesis/1-introduction.tex @@ -1,3 +1,7 @@ +% arara: lualatex: {shell: true} +\documentclass[thesis]{subfiles} +\begin{document} + \chapter{Introduction} The goal of this thesis is to combine the concept of an \acl{OBS} that the Rust programming language supports with Refinement Types as found in some functional languages like ATS and Liquid Haskell and theorem provers like Lean. @@ -12,607 +16,17 @@ Therefore, most programming languages automate memory management with \ac{GC}. I A different approach is using \ac{RAII} like C++ does for some of its types and which does static analysis on the code to insert cleanup code at the correct places. Rust is a new language that embraces this kind of memory management in its \ac{OBS}. In this section we will explore the intricacies of \ac{OBS} as Rust uses it. -\subsection{Ownership} - -In most programming languages, one can freely create and delete resources. For example, in C the programmer allocates memory on the heap with \inline{malloc} and frees the memory when they don’t need the memory anymore using \inline{free}. - -\begin{lstlisting}[language=C, caption={Memory management in C}] - // Allocate space for an `int`. - int *ptr = (int*) malloc(sizeof(int)); - - // ... - - // Free the memory. - free(ptr); -\end{lstlisting} - -Even in languages without manual memory management, the programmer still must manage other resources. For example, to work with a file in Python one could write: - -\begin{lstlisting}[language=Python, caption={Manual resource management in Python}] - # Open the file and store the file handle in a variable. - file = open("path/to/some.file", "r") - -# ... - -# Close the file -file.close() -\end{lstlisting} - -\subsubsection{Resource Handling Strategies} - -This manual resource management comes with some difficulties. The programmer has to watch out to always free the resources after using them, because otherwise the program could leak memory or unnecessarily inhibit other processes from accessing the resources. On the other hand, one may also not free the resources too early, which would lead to a use after free, and also not free resources multiple times. This can be very tricky in complex programs where resources are shared between threads or over \acs{API} boundaries. - -That is why several programming languages have ways to make handling resources easier. Python has context managers that close resources automatically. - -\begin{lstlisting}[language=Python, caption={Context managers in Python}] -with open("path/to/some.file", "r") as file: - #... - -# At this point `file` is closed. -\end{lstlisting} - -This code is equivalent to the previous one, but even better: It also closes the file in case of an exception. Programmers can also define custom context managers for their own resources. - -Other languages have different strategies. Go supports the \inline{defer} statement which takes an operation that is not to be run immediately but at the end of the scope. - -\begin{lstlisting}[language=Go, caption={The \inline{defer} statement in Go}] -{ // Block starts here. - file, err := os.Open("path/to/some.file") - defer file.Close() - - // ... - -} // Block ends here; run the deferred close file -\end{lstlisting} - -These prevent double frees and use after free errors, but they can still be forgotten. A programmer who wants to use a resource must be aware that it has to be cleaned up -in the end and changes in one line of the code have to be mirrored in the other. It is not as bad as in manual resource management because the programmer has to change only one additional line and that line is usually close. - -\subsubsection{\acs{RAII} and Drop Responsibilities} - -A third variant is \acfi{RAII}, also called \acfi{SBRM}, used by C++ and Rust. This ensures that the destructor of an object is run when it goes out of scope, which will clean up automatically. An example is the \inline{std::unique_ptr} in C++. The pointer \emph{owns} the memory it points to, meaning when it is created, it automatically allocates memory on the heap and when it goes out of scope, it frees the memory. +\subfile{1-1-1-ownership} -\begin{lstlisting}[language=C++, caption={\ac{RAII} in C++}] -{ // Block starts here. - std::unique_ptr<int> ptr = std::make_unique<int>(42); - - // ... - -} // Block ends here; run the destructor and free the memory -\end{lstlisting} - -\ac{RAII} ensures that resources are always freed when they are not live (that is, may not be used) anymore. The programmer can’t forget to free the resources nor cause use after free or double free. While in C++ only certain types conform to \ac{RAII}, in Rust every value is owned by a variable. For example, the Rust analog to an owning pointer is \inline{Box}. - -\begin{lstlisting}[language=Rust,caption={Heap allocation in Rust}] -{ // Block starts here. - let ptr = Box::new(42) - - // ... - -} // Block ends here; drop value and free memory -\end{lstlisting} - -As soon as a variable falls out of scope, it is \emph{dropped}, meaning all -associated resources are freed. It is said that a variable has a \emph{drop responsibility}. A programmer can define custom drop behavior for their own resources by implementing the \inline{Drop} trait for their type, but normally the compiler does it for us. - -\subsubsection{Move Semantics} - -\ac{RAII} comes with a few disadvantages, though. Since a value is dropped when its owner goes out of scope, there must always be exactly one owner for every value. This means that the compiler will transfer ownership sometimes, like in the following example. - -\begin{lstlisting}[language=Rust,caption={Ownership transfer},label={lst:ownership-transfer}] -fn print_value(ptr: Box<i32>) { - println!("{}", *ptr); -} - -fn main() { - let mut ptr = Box::new(42); - - print_value(ptr); - - *ptr += 1; -} -\end{lstlisting} - -We create a pointer and use it as an argument to a function that prints the pointee’s value. After that we increment the value. But if we try to compile this, we get an error: - -\begin{lstlisting} -error[E0382]: use of moved value: `*ptr` -\end{lstlisting} - -Because we supplied \inline{ptr} as an argument to \inline{print_value}, it took ownership of that value, and now has the drop responsibility. As soon as the call to \inline{print_value} is finished, the memory is freed and accessing it afterwards is a use after free. We say that the value was \emph{moved} out of \inline{ptr} and into \inline{print_value}. - -\subsubsection{Clone} - -We can still make it work by \emph{cloning} \inline{ptr} and supply the clone as an argument. \inline{ptr.clone()} creates a new \inline{Box}, allocates new memory, and initializes it with the same value as \inline{ptr}’s: - -\begin{lstlisting}[language=Rust, caption={Usage of \inline{Clone}}] -fn print_value(ptr: Box<i32>) { - println!("{}", *ptr); - } - -fn main() { - let ptr = Box::new(42); - - print_value(ptr.clone()); - - *ptr += 1; -} -\end{lstlisting} - -Now everything works but keep in mind that the clone is completely independent from the original value. If we changed the value of the clone in \inline{print_value}, it would not be visible to the outside. - -\subsubsection{Copy} - -We previously noted that ownership, and therefore move semantics, apply to every value in Rust. But if we change the code to use an \inline{i32} directly instead of putting it on the heap, we don’t need to clone anything: - -\begin{lstlisting}[language=Rust, caption={Number types are \inline{Copy}}] -fn print_value(number: i32) { - println!("{}", number); - } - -fn main() { - let mut number = 42; - - print_value(number); - - number += 1; -} -\end{lstlisting} +\subfile{1-1-2-borrowing} -This compiles and works. That is not because \inline{i32} breaks move semantics but because the type implements the \inline{Copy} trait. Normally, if a value is moved, the physical bits making up that value are moved to a new location, e. g. the new stack frame, and are not available at the previous position anymore. If a type implements \inline{Copy}, the bit pattern is instead copied over and retained. Because \inline{i32} doesn’t allocate any heap or handles any resources, such a copy is valid. A \inline{Box} is not \inline{Copy}, because then two owners for the same resource would exist which would violate drop responsibility. If we want to duplicate a \inline{Box}, we need to allocate new memory on the heap and initialize it properly, which is what \inline{clone} does. +\subfile{1-1-3-reference-conversion} -\subsubsection{ManuallyDrop} - -Sometimes we don’t want \ac{RAII} to happen, we want to free resources ourselves. If that is the case, we can wrap a value in a \inline{ManuallyDrop} which tells the compiler to not drop it for us. This is a general trend in Rust: The correct way should be the easiest to do, and all potentially unsafe constructs are opt-in. More information on \inline{ManuallyDrop} can be found in the documentation for the type.\footnote{\url{https://doc.rust-lang.org/stable/std/mem/struct.ManuallyDrop.html}} - - -\subsection{Borrowing} - -As we have seen, ownership and move semantics ensure memory safety. But they are not easy to use, \inline{clone}ing data has a runtime overhead, and many correct programs are rejected. For example, \autoref{lst:ownership-transfer} describes a program that would work if it were not for move semantics. - -That is where references and borrowing come in. Instead of taking ownership of a value, a function can only borrow it through a reference. Then the drop responsibility stays with the caller. References, of course, can not be used for everything, but for our case it is sufficient. We mark the argument to \inline{print_value} as a reference using \inline{&}, and creating a reference from a value works the same. - -\begin{lstlisting}[language=Rust, caption={Borrow using references to prevent a move}, label={lst:borrow}] -fn print_value(ptr: &Box<i32>) { - println!("{}", *ptr); -} - -fn main() { - let mut ptr = Box::new(42); - - print_value(&ptr); - - *ptr += 1; -} -\end{lstlisting} - -\autoref{lst:borrow} compiles without an error and does what we would expect. There are two kinds of references. We just looked at \emph{shared} or \emph{immutable references}. The other kind is \emph{exclusive} or \emph{mutable references}, and they are denoted with \inline{&mut}. The different kinds of references have a different semantics attached to them. While both are used to access values without taking ownership, there are specific rules for the creation and guarantees associated with each: - -\begin{itemize} - \item A shared reference to a value can always be created, as long as there is no exclusive reference to the same value. It is not possible to mutate the pointee through a shared reference (which is why it is also called immutable). - \item An exclusive reference can only be created if there is no other reference to the value at all and the referenced value is declared mutable. An exclusive reference can do everything the owner can, except dropping. -\end{itemize} - -The part of the compiler that enforces these rules is the \emph{borrow checker}. The reasoning behind this is again resource safety, namely preventing \emph{unguarded mutable aliasing}. Having multiple readers of the same data doesn’t cause issues, as long as the data cannot be mutated. Mutating data is fine, as long as no one else can read and/or mutate the data at the same time. Using these two kinds of references enforces, at compile time, that we will always stay on the happy path. But there are some programs that are correct even though they violate these rules. We will be concerned with extending the borrow checker to accept more correct programs in later sections. - -\subsubsection{The Owner of a Borrowed Value} - -References have to agree to these rules, but the owner has to as well. While shared references exist, the owner may not mutate, e. g. \inline{let mut x = 42; let xref = &x; x += 1} is forbidden. Similarly, the owner can’t access a value at all as long as there is an exclusive reference around. - -\todo[inline]{ - \textbf{This doesn’t not work, I have to think about this some more. I want something to throw Error E0507.} - - In particular, a value may not be moved, as long as there is a reference to it because that would create a dangling pointer. This means that \inline{let y = &x; let z = *y;} doesn’t work. - - %\begin{lstlisting}[language=Rust, caption={Moving out of borrowed value}] - % let x = 42; - % let xref = &x; - % let y = x; - %\end{lstlisting} - -} - -\subsubsection{Lexical Lifetimes and Lifetime Analysis} - -The parts of the program in which a reference is valid is called its \emph{lifetime}. The borrow checker keeps track of all references’ lifetimes and their so-called \emph{provenances}, that is, who the original owner of the referenced value is. References are values like all others and so they are dropped at the end of the scope they are defined in. In other words: Their lifetime starts at their creation and ends at the end of the scope. This is called a \emph{lexical lifetime}. Therefore the code in \autoref{lst:lexical-lifetimes} does not compile. - -\begin{lstlisting}[language=Rust, caption={Lexical Lifetimes prevent usage}, label={lst:lexical-lifetimes}] -let mut x = 0; -{ - let xref = &x; // `xref` is created - println!("{xref}"); - x += 1; // error -} // `xref` is dropped -x += 1; // OK, no references exist anymore -\end{lstlisting} - -The program in \autoref{lst:lexical-lifetimes} is rejected by our borrow checker since it registers a mutation to \inline{x} while a reference to it still exists. Only after the block ends, the borrow is returned to the owner and it can be used again. Similarly, we would not be able to create an exclusive reference while a shared one exists and vice versa. - -But we can see that the program is correct since \inline{xref} is never accessed after \inline{x} is changed and we can save the code by introducing an additional scope. - -\begin{lstlisting}[language=Rust, caption={More scopes make the code check}, label={lst:saving-lexical-lifetimes}] -let mut x = 0; -{ - { - let xref = &x; // `xref` is created - println!("{xref}"); - } // `xref` is dropped - x += 1; -} -x += 1; -\end{lstlisting} - -This is tedious, though, and we would like the borrow checker to recognize patterns like this automatically. Luckily, Rust implements an improved borrow checker since version 1.31. Because it is not based on lexical scope for lifetime analysis anymore, these new lifetimes are called \acfi{NLL}. We will take a closer look at them in \autoref{subsec:non-lexical-lifetimes}. - -\subsubsection{Interior Mutability and Access Guards} - -The borrow rules are enforced at compile time, but sometimes the compiler can’t know if a piece of code follows the rules. Because of this, there is a runtime-checked version of reference with \inline{RefCell} which causes a panic if the borrow rules are broken. Also, sometimes we need access to a resource from multiple places at the same time, for example when sharing data between threads. For this, Rust provides the \inline{Mutex} container type. References to a mutex can be shared freely, but to change the value in the container, one has to acquire a lock, therefore making the access \emph{guarded}. While the mutex is locked, no other thread can access the data at all, as if the thread held a \inline{&mut}. Alternatively, \inline{RwLock} can give out both read and write locks which have behavior analogous to \inline{&} and \inline{&mut}, respectively. There are more constructs for similar use cases in the standard library, like \inline{Arc} and \inline{Cow}. - -These types are implemented using the \inline{Cell} type which can change a value even if it is not marked as mutable. This is called \emph{interior mutability} and uses \emph{unsafe Rust} internally, a superset of standard, safe Rust. The main feature of unsafe Rust is access to so-called \emph{raw pointers}. Raw pointers are, like pointers in C, not borrow checked by the compiler. Since borrow checking is undecidable, the compiler sometimes can’t prove that, for example, two references don’t alias. In these cases, the programmer can step in, prove the non-aliasing manually, implement a feature using raw pointers and provide a safe abstraction for consumers of the code.\todo{Pointers to Strict Provenance, Miri, …?} - -\subsubsection{Returning references and Borrow-through} - -Functions can receive references as arguments, but they can also return references. One has to be a bit careful when doing this, though, since all resources created in the scope of a function are freed as soon as the function returns. Consider the following: - -\begin{lstlisting}[language=Rust, caption={Try to return a reference}] -fn to_ref(number: i32) -> &i32 { - &number -} -\end{lstlisting} - -This fails because \inline{number} is owned by \inline{to_ref} and is dropped as soon as the function returns. The reference would already be invalid when the caller gets access to it. But if the function takes a reference as an argument, it can pass another reference back to the caller. This is called a \emph{borrow-through} and looks like the following: - -\begin{lstlisting}[language=Rust, caption={Borrow-through}] -fn to_ref(number: &i32) -> &i32 { - &number -} -\end{lstlisting} - -Now Rust can couple both references to each other. The returned reference has the same provenance as the input one, and the borrow checker can check that the input reference’s lifetime is at least as long as the returned one. - -\subsubsection{Lifetime Polymorphism} - -Consider a function that takes two references and returns one reference itself. Imagine for example, we have two slices and want to create an iterator over all elements of both slices. Then we could try to write a function like in \autoref{lst:two-refs}. The \inline{impl Iterator<Item=&T>} is a so-called existential type. It just means that we don’t care about what type exactly it is as long as it is an iterator. - -\begin{lstlisting}[language=Rust, caption={Iterator over two slices}, label={lst:two-refs}] -fn both(first: &[i32], second: &[i32]) - -> impl Iterator<Item=&i32> -{ - first.iter().chain(second.iter()) -} -\end{lstlisting} - -This does not work however. - -\begin{lstlisting} -error[E0106]: missing lifetime specifier -\end{lstlisting} - -The compiler is confused because it does not know which provenance and lifetime to assign to the returned references. We as programmers can see that the return value depends on both input references, so we can help Rust by providing lifetime hints.\footnote{Actually, the compiler could infer these lifetimes but it would rely on global program analysis to do so. It is also possible to infer function types, but Rust chose to always be explicit about the types and lifetimes you use and have them be explicit parts of the API of a function.} - -\begin{lstlisting}[language=Rust, caption={Iterator over two slices}, label={lst:lifetime-parameters}] -fn both<'a>(first: &'a [i32], second: &'a [i32]) - -> impl Iterator<Item=&'a i32> -{ - first.iter().chain(second.iter()) -} -\end{lstlisting} - -\inline{'a} is a \emph{polymorphic lifetime parameter}. We tell Rust that the output relies on both inputs being valid for at least as long as itself. If the output relied on only one of the arguments, we would mark only one of them with a lifetime specifier. This would for example be the case if we want to search for an element in a slice and, if it exists, return a mutable reference to that element. We implement such a function in \autoref{lst:two-refs-one-dependent}. - -\begin{lstlisting}[language=Rust, caption={The output lifetime depends only on one of the inputs}, label={lst:two-refs-one-dependent}] -fn find_mut<'a>(haystack: &'a mut [i32], needle: &i32) - -> Option<&'a mut i32> -{ - haystack.iter_mut().find(|x| **x == *needle) -} -\end{lstlisting} - -\subsection{Reference Conversions} - -Sometimes, we have a reference to one type but need a reference to another, similar type. For example, a vector of some type \inline{Vec<T>} is conceptually the same as a slice of that same type \inline{[T]}, except that a \inline{Vec} can grow and shrink and a slice can not. This means that all functions which operate on slices should also work with vectors, and in fact there is a library method \inline{Vec::as_slice} that takes a reference to a vector and provides a reference to a slice. Similarly there is \inline{String::as_str} which transforms a \inline{&String} into a \inline{&str}. - -\subsubsection{AsRef} - -Generally, there are many types that can act as a substitute for another. The common interface for this behavior is the \inline{AsRef} trait. There can be many implementations of this trait for a given type. for example, \inline{String} can stand in for \inline{str}, \inline{[u8]}, \inline{OsStr} and \inline{Path}. Every time a reference to one of these types are needed, we can use a \inline{String} instead, if we first call \inline{as_ref} on it: - -\begin{lstlisting}[language=Rust, caption={Use \inline{String} and \inline{as_ref} in place of \inline{[u8]}}, label={lst:as_ref_bytes}] -fn needs_bytes(x: &[u8]) { /*...*/ } - -// ... - -let s = String::from("Hello Bytes"); -needs_bytes(s.as_ref()); -\end{lstlisting} +\subfile{1-1-4-non-lexical-lifetimes} -\subsubsection{Deref} +\subfile{1-1-5-reborrows} -Always calling \inline{as_ref} is a bit cumbersome, especially if it is clear which type the target should be. For example, if we have a \inline{Box<T>} there is really only one reasonable type we want to get out of it, namely a \inline{&T}. The trait \inline{Deref} can be used for that. It specifies a single type which Rust automatically converts into if it is needed. For example, \inline{String} has \inline{str} as its \inline{Deref} target which allows us to omit the \inline{as_ref} in that case and leave the conversion implicit: - -\begin{lstlisting}[language=Rust, caption={Use \inline{String} in place of \inline{[u8]}}] -fn needs_str(x: &str) { /*...*/ } - -// ... - -let s = String::from("Hello Bytes"); -needs_bytes(s); -\end{lstlisting} - -We couldn’t do the same in \autoref{lst:as_ref_bytes} because \inline{[u8]} is not \inline{Deref} target of \inline{String}. - -\subsubsection{Borrow} - -Sometimes we want to express an even stronger connection between two types. For example, a \inline{HashMap} needs to take ownership of its entries but we want to be able to do a lookup even if we only have a reference. But that requires that the owned and the borrowed value behave exactly the same. This is what the \inline{Borrow} trait signals. In particular, \inline{x.borrow() == y.borrow()} if and only if \inline{x == y} and \inline{x.hash() == x.borrow().hash()}. - -\subsubsection{Mutable reference conversions} - -All three of the aforementioned traits work for shared references only, but their variants \inline{AsMut}, \inline{DerefMut} and \inline{BorrowMut} all take and provide an exclusive reference. - -\subsection{Non-Lexical Lifetimes}\label{subsec:non-lexical-lifetimes} - -In \autoref{lst:lexical-lifetimes} we saw how lexical lifetimes stood in our way and we had to manually wrap a reference in an additional block just to make our correct program pass the borrow checker. This was because the lifetime of any value is bound to its scope, at the end of which it is dropped. \acl{NLL} are a different approach to lifetime analysis. A reference is live for as long as it may be used later. The borrow checker determines the points of the program in which the reference is live by building a \acfi{CFG}. All nodes of the \ac{CFG} that can be reached from the point of creation until the last use of the reference are where it is live. Consider the program in \autoref{lst:non-lexical-lifetimes}. - -\begin{lstlisting}[language=Rust, caption={Simple \ac{NLL} example}, label={lst:non-lexical-lifetimes}] -let mut x = 0; -let xref = &x; -println!("{xref}"); -x += 1; -\end{lstlisting} - -Then the (simplified) \ac{CFG} looks something like \autoref{fig:cfg-non-lexical-lifetimes}. It is quite boring because the control flow is linear. The nodes with thick borders are where \inline{xref} is live. The first node is the one in which the reference is created and the second one is where it is last used. Because \inline{x} is not modified in that section of the \ac{CFG}, the borrow checker doesn’t complain. - -\begin{figure}[!ht] - \centering - \begin{tikzpicture} - \begin{scope}[every node/.style={draw, rectangle}] - \node (A) at (0,4.5) {\inline{let mut x = 0;}}; - \node[ultra thick] (B) at (0,3) {\inline{let xref = &x;}}; - \node[ultra thick] (C) at (0,1.5) {\inline{println!("\{xref\}");}}; - \node (D) at (0,0) {\inline{x += 1;}}; - \end{scope} - \path [->] (A) edge node[left] {} (B); - \path [->, ultra thick] (B) edge node[left] {} (C); - \path [->] (C) edge node[left] {} (D); - \end{tikzpicture} - \caption{Control-flow diagram for \autoref{lst:non-lexical-lifetimes}} - \label{fig:cfg-non-lexical-lifetimes} -\end{figure} - -Consider a more complicated example that includes branching in \autoref{lst:non-lexical-lifetimes-branch}. - -\begin{lstlisting}[language=Rust, caption={\ac{NLL} example with braching}, label={lst:non-lexical-lifetimes-branch}] -let mut x = 0; -let xref = &x; -if xref != &0 { - x += 1; - println!("xref equals {xref}"); -} else { - println!("xref is 0"); -} -x = 42; -\end{lstlisting} - -Now the \ac{CFG} (\autoref{fig:cfg-non-lexical-lifetimes-branch}) splits up to accommodate both possible paths the program could take during execution. At the \inline{if}, the graph splits into two, but in the last node, both paths join up again. Here we can see that in the \inline{false} branch no problem occurs, but in the \inline{true} branch we try to modify \inline{x} even though \inline{xref} is used later in the same path. Modifying \inline{x} on the last line does not pose any problems since xref is not live anymore on any path that leads to this point. - -\begin{figure}[!ht] - \centering - \begin{tikzpicture} - \begin{scope}[every node/.style={draw, rectangle}] - \node (A) at (0,7.5) {\inline{let mut x = 0;}}; - \node[ultra thick] (B) at (0,6) {\inline{let xref = &x;}}; - \node[ultra thick] (C) at (0,4.5) {\inline{if xref != &0 }}; - \node[ultra thick, draw=red] (D) at (-3,3) {\inline{x += 1;}}; - \node[ultra thick] (E) at (-3,1.5) {\inline{println!("xref equals \{xref\}");}}; - \node (G)[ultra thick] at (3,2.5) {\inline{println!("xref is 0");}}; - \node (H) at (0,0) {\inline{x = 42;}}; - \end{scope} - \path [->] (A) edge node[left] {} (B); - \path [->, ultra thick] (B) edge node[left] {} (C); - \path [->, ultra thick] (C) edge node[left] {\inline{true}} (D); - \path [->, ultra thick] (D) edge node[left] {} (E); - \path [->, ultra thick] (C) edge node[right] {\inline{false}} (G); - \path [->] (G) edge node[left] {} (H); - \path [->] (E) edge node[left] {} (H); - \end{tikzpicture} - \caption{Control-flow diagram for \autoref{lst:non-lexical-lifetimes-branch}} - \label{fig:cfg-non-lexical-lifetimes-branch} -\end{figure} - -\subsection{Reborrows and Two-Phase Borrowing} - -\ac{NLL} give us a lot more freedom when using references, but there are still programs that are clearly correct but don’t pass the borrow checker, especially if exclusive references are in play. It is not possible to create a new reference of some provenance as long as an exclusive reference with the same provenance is live. This means the code in \autoref{lst:conflicting-borrows} will not compile. - -\begin{lstlisting}[language=Rust, caption={Cannot create a shared reference while an exclusive one exists}, label={lst:conflicting-borrows}] -let mut x = 1; -let ref1 = &mut x; -let ref2 = &x; // Error: Cannot borrow `x` -println!("{ref2}"); -*ref1 = 2; -\end{lstlisting} - -But since \inline{xs} is not live anymore when \inline{xu} is used, aliasing is never occurring, and if we hadn’t used an exclusive reference but modified the value directly, Rust would have been able to see this. Not being able to create references in this manner is a big thing. Consider the code in \autoref{lst:moved-borrow}. Here, we create a reference to an empty \inline{Vec} and then use it to \inline{push} a value. This moves the reference into the \inline{push} method so that it is dropped when the method returns. It is no longer valid to use it again in the next line since it has already been dropped. But in fact this code compiles because Rust implicitly inserts a \emph{reborrow} for us. - -\begin{lstlisting}[language=Rust, caption={Move a borrow into a function}, label={lst:moved-borrow}] -let mut v = Vec::new(); -let vref = &mut v; -vref.push(1); -vref.last(); // Error: Use of moved value `vref` -\end{lstlisting} - -\subsubsection{Reborrows} - -Let’s get back to the example in \autoref{lst:conflicting-borrows} for now. The compiler complains because we try to create two conflicting references with the same provenance. But we can tell the compiler to temporarily deactivate a reference by borrowing \emph{through this reference}. This is done in \autoref{lst:simple-reborrow}. - -\begin{lstlisting}[language=Rust, caption={Reborrow through an exclusive reference}, label={lst:simple-reborrow}] -let mut x = 1; -let ref1 = &mut x; -let ref2 = & *ref1; // Reborrow -println!("{ref2}"); -*ref1 = 2; -\end{lstlisting} - -Now we can create and use the \inline{ref2} as long as it is not live anymore when we use \inline{ref1} again. If the last two lines were swapped, the borrow checker would correctly find the interleaved use and throw an error. It is possible to nest reborrows, as long as they are never used interleaved. These reborrows can be imagined as a stack: when creating a reborrow, the new reference is pushed to the stack, and when using a reference, the stack is popped until it is on the topmost position. Trying to access a reference that has already been popped indicates an interleaved use and is a borrow error. The owner of the value can then be thought of as the bottom element of the stack, and when it goes out of scope, the whole stack is unwound so that the \inline{drop} function can access it. In fact, this is exactly how reborrows are modelled in Miri, which is an interpreter for Rusts \acfi{MIR}. \todo{Source, e. g. Rustonomicon} Miri even has borrow stacks for raw pointers. - -But why does the code in \autoref{lst:moved-borrow} work? The compiler can insert borrows and reborrows in function calls, so-called \emph{autorefs}. This is because it can be sure that the function returns before the reference is accessed again, meaning that the borrow stack is kept intact. In fact, if the compiler can’t proof that the borrow stack is valid, it will complain. This can for example happen when using \inline{std::thread::spawn} which creates a new thread. Because the child thread could outlive the main thread, Rust can’t verify that all references are used in the correct order, in particular that the owner is not dropped too early. - -\subsubsection{Two-Phase Borrowing} - -Reborrows won’t solve every problem, though. Consider the code in \autoref{lst:two-phase-borrow}. Here we create a vector and then push its length onto it. - -\begin{lstlisting}[language=Rust, caption={Borrow twice in one method call}, label={lst:two-phase-borrow}] -let mut v = Vec::new(); -v.push(v.len()); -\end{lstlisting} - -The problem lies in the second line. The first argument to \inline{push} is an \inline{&mut self}, and so the compiler implicitly borrows from \inline{v}. But then another (shared) reference is created for the call to \inline{len}. This is not allowed, and a reborrow doesn’t help either since the \inline{&mut self} is currently used. On the other hand, it is clear that the call to \inline{len} will definitely return before the exclusive reference is ever accessed. In fact, we can work around this problem if we save the result of \inline{v.len()} in a temporary variable, as shown in \autoref{lst:two-phase-borrow-workaround}. - -\begin{lstlisting}[language=Rust, caption={Workaround for \autoref{lst:two-phase-borrow}}, label={lst:two-phase-borrow-workaround}] -let mut v = Vec::new(); -let vlen = v.len(); -v.push(vlen); -\end{lstlisting} - -This pattern—calling a method with an exclusive reference but also using shared references in the arguments—is very common and the workaround is unwieldy. Therefore \acs{RFC} 2025 \cite{rfc2025nestedmethodcalls} introduces the concept of a \ac{TPB}. In it, the lifetime of an exclusive reference is split up into two phases. During the \emph{reservation phase}, it is treated like a shared reference, meaning more shared references can be created and reads are possible. The \emph{activation} happens as soon as the reference is first used to perform a mutation. From this point on, it is treated as a full exclusive reference. - -\subsubsection{Generalized \ac{TPB}} - -Right now, a \ac{TPB} may only happen in a few specific places, namely when calling a method with \inline{&mut self} as first argument and with assignment operators (\inline{+=}, \inline{*=}, …), and the other references can only be shared. There are several ways in which the requirements can be relaxed and which are discussed in the RFC and Issue \#49434\footnote{\url{https://github.com/rust-lang/rust/issues/49434}}. - -Currently, the only place an exclusive reference is allowed is in \inline{&mut self} position. This means that it is not possible to use an exclusive reference twice, and a reborrow is also not possible because the references have no names. One could relax this and allow exclusive references everywhere, and also mix shared and exclusive references. This poses the difficulty that a shared reference (or one during reservation) can observe change. Resolving this makes the model more complicated. - -In a similar vein, \ac{TPB} could be expanded to all function calls or even in inline code, as shown in \autoref{lst:inline-tpb}. Here, creating a shared reference is not allowed because an exclusive one already exists and is live. With generalized \ac{TPB} it would be okay, though, since \inline{xm} is only in reservation phase at that point. - -\begin{lstlisting}[language=Rust, caption={Example of inline \ac{TPB}}, label={lst:inline-tpb}] -let mut x = 1; -let xm = &mut x; -let xr = &x; // Error -*xm = 2; -\end{lstlisting} - -Even further, \emph{discontinuous borrows} would make it possible that a borrow is only active at exactly the points at which it is used and deactivated in between. Also, if the mutation capabilities of the reference is not used anymore, it could be \emph{downgraded} to a shared reference: For example, a function could take an exclusive reference to a value but then returns only a shared one. - -All of these changes would entail a big step up in complexity and surface area of the language without actually having much practical use since relevant situations are rare. They are discussed in detail in \cite{rfc2025nestedmethodcalls}. - -\subsection{Loans and Regions} - -Previously, we looked at the advantages of \acl{NLL} over standard lexical ones. But lifetimes have a general defect and are supposed to be replaced by an approach using \emph{regions} employed by the \emph{Polonius borrow checker}, which is currently in an experimental stage. Here, we will explore how \ac{NLL} and regions work under the hood, and what the advantage of using regions is. - -% \begin{lstlisting}[language=Rust, caption={Simple \ac{NLL} example}, label={lst:non-lexical-lifetimes}] -% let mut x = 0; -% let xref = &x; -% println!("{xref}"); -% x += 1; -% \end{lstlisting} - -\subsubsection{Borrow Errors} - -We need to define some vocabulary first: A \emph{path} is an identifier like \inline{x}, or is built from a path by a field access \inline{x.f}, a dereference \inline{*x}, or an index \inline{x[i]}. Those can be freely combined, so that for example \inline{(*x)[5].f} is a valid path.\footnote{Paths have a rough equivalent in C and C++ lvalues.} - -A \emph{loan} is the result of a borrow expression. It consists of the path which is borrowed from and a \emph{mode}, that is shared or exclusive. - -A loan $L$ is \emph{violated} if its path $P$ is accessed in an incompatible way, that is, $P$ is mutated when the $L$ is shared, or $P$ is accessed at all when $L$ is exclusive. Note that an access can also be \emph{indirect} if $P$ is shows up somewhere in the expression. For example, \inline{(*x)[5].f} accesses \inline{(*x)[5]}, \inline{*x}, and \inline{x} indirectly. Note also that a loan to an index ignores the index variable, that is \inline{x[5]} and \inline{x[4]} produce the same loan to \inline{x[_]}. This is because Rust can generally not know at compile time if two index operations alias. It means that it is impossible to have two exclusive references to different parts of a data structure like in \autoref{lst:two-indices}. - -\begin{lstlisting}[language=Rust, caption={Indexing twice into a vector is illlegal}, label={lst:two-indices}] -let mut v = vec![1, 2]; -two_refs(&mut v[0], &mut v[1]); -\end{lstlisting} - -Now we can define when a borrow error should occur. There are three conditions which all have to be met: - -\begin{enumerate} - \item A path $P$ is accessed at some node $N$ of the \ac{CFG}, - \item accessing $P$ at $N$ would violate some loan $L$, and - \item $L$ is live at $N$. -\end{enumerate} - -Different approaches to borrow checking only differ in determining when $L$ is live. For example, with lexical lifetimes a loan is simply live from its creation until the end of the lexical scope. We are now prepared to dive into liveness analysis in \ac{NLL} and Polonius. - -\subsubsection{Classic \ac{NLL}} - -Under \ac{NLL} the liveness of a loan is derived from the lifetime of a reference. As discussed in \autoref{subsec:non-lexical-lifetimes}, a reference is live in a node of the \ac{CFG} if it may be used later. This means that we walk forward along the \ac{CFG} to determine the liveness of the reference and the corresponding loan is live exactly when the reference is. Crucially, if a function returns a reference, it is live for the whole body of the function. \todo{lifetime subtyping and inferred lifetimes/constraints, “outlive†relationship, \inline{'a : 'b}} - -\begin{lstlisting}[language=Rust, caption={\ac{NLL} reject correct program}, label={lst:nll-reject-correct}] -fn first_or_insert(v: &mut Vec<i32>) -> &i32 { - let fst = v.first(); - match fst { - Some(x) => x, - None => {v.push(1); &v[0]}, - } -} -\end{lstlisting} - -\autoref{lst:nll-reject-correct} shows an example. In the \inline{Some} branch, \inline{x} is returned from the function, which in turn depends on the reference produced by \inline{fst}. Because \inline{x} is returned from the function, it needs to be live at least until the end of the body of \inline{first_or_insert}. But since \inline{x} is derived from \inline{fst}, that reference must outlive \inline{x}, hence being live for the whole function body as well. In the \inline{None} branch, an exclusive reference to \inline{v} is created for the call to \inline{push}. This produces an error because that node lies on a path between the creation of \inline{fst} and the return point.\todo{Do the listing again with lifetime annotations?} - -This should not happen. We can see that \inline{fst} is not actually used when we go through the \inline{None} arm because a different reference is returned in that case. However, \ac{NLL} can’t accommodate this situation because \inline{fst} \emph{may} be used later, see \autoref{fig:cfg-nll-reject-correct}. A similar problem for map data structures has led to the \inline{Entry} \ac{API} which uses unsafe Rust to work around this limitation. - -\begin{figure}[!ht] - \centering - \begin{tikzpicture} - \begin{scope}[every node/.style={draw, rectangle}] - \node[ultra thick] (Z) at (0,4) {\inline{let fst = v.first();}}; - \node[ultra thick] (A) at (0,3) {\inline{match fst}}; - \node[ultra thick] (B) at (-1,1.5) {\inline{x}}; - \node[ultra thick, draw=red] (C) at (1,2) {\inline{v.push(1);}}; - \node[ultra thick] (D) at (1,1) {\inline{&v[0]}}; - \node[ultra thick] (E) at (0,0) {\inline{return}}; - \end{scope} - \path [->, ultra thick] (Z) edge node[left] {} (A); - \path [->, ultra thick] (A) edge node[left] {\inline{Some}} (B); - \path [->, ultra thick] (A) edge node[right] {\inline{None}} (C); - \path [->, ultra thick] (C) edge node[left] {} (D); - \path [->, ultra thick] (B) edge node[left] {} (E); - \path [->, ultra thick] (D) edge node[left] {} (E); - \end{tikzpicture} - \caption{Control-flow diagram for \autoref{lst:nll-reject-correct}.} - \label{fig:cfg-nll-reject-correct} -\end{figure} - -\subsubsection{Polonius} - -With lifetimes, we looked \emph{forwards} from the borrow expression to see how long a reference (and therefore the loan) is live. Polonius goes \emph{backwards} from a point of use to see if there is still a live reference. Polonius doesn’t use lifetimes but \emph{regions} to determine the liveness of a loan. - -A \emph{region}\footnote{Polonius calls regions \emph{origins}, which is a more telling name, but regions is the more standard term.} is a set of loans. Each reference has a region consisting of all loans it may depend on. A fresh reference created from an owned value has a region consisting of just one loan, but a reference, e. g. if returned by a function, could depend on several inputs and therefore have a region of several loans. Note that in this step we don’t care about how long a reference is valid, we don’t go forward in the \ac{CFG}. Instead, we only consider previous nodes to determine regions. - -Now, a loan $L$ is live at some node $N$, if there is some variable which is live at $N$ and contains $L$ in its region. This difference means that different paths through the \ac{CFG} are independent from each other, because a node in one path can’t see a node in the other one by walking back the \ac{CFG}. - -Let’s look at the example from \autoref{lst:nll-reject-correct} with all regions made explicit. \lstinline~x'{L1, L2}~ denotes that expression \inline{x} has a region consisting of the two loans \inline{L1} and \inline{L2}. - -\begin{lstlisting}[language=Rust, caption={Example with region annotations}, label={lst:regions}] -fn first_or_insert(v'{L0}: &mut Vec<i32>) - -> &'{L0, L1, L3} i32 -{ - let fst'{L0, L1} = v.first()'{L0, L1}; - match fst { - Some(x'{L0, L1}) => x, - None => {v.push(1)'{L0, L2}; &v[0]'{L0, L3}}, - } -} -\end{lstlisting} - -In \autoref{lst:regions} we can see that there are four relevant loans: \inline{L0} is the loan of the reference we got passed in. All references depend on it. \inline{first} creates a reference with loan \inline{L1} that is returned in the \inline{Some} branch, \inline{push} implicitly reborrows to push a value onto \inline{*v}. The final reference is created by the index operation and it may also be returned. Therefore, the return value has a region \lstinline~'{L0, L1, L2}~, because those three loans are what it may depend on. - -Under \ac{NLL}, the \inline{push} was not possible because \inline{x} being live and depending on \inline{fst} meant that \inline{fst} was live. With Polonius, we must check if there is any live variable that has a nonempty intersection with \lstinline~'{L0, L2}~. \inline{fst} and \inline{x} are not live, so they don’t pose a problem, even if the regions overlap. \inline{v} is live and there is a region overlap, but since the compiler inserts a reborrow, it is not a problem. There could still be an error if the borrow stack were invalidated at a later point, but since Polonius is only looking backwards, this is not something we have to consider here. There are no more live variables, so the node passes the borrow check. - -\subsubsection{Self-referential Structs} - -Sometimes we want to have structures that contain references to parts of itself.\footnote{A real-world example of this are futures which must store the point of execution they are in. Currently they are a special case in the compiler and can’t be expressed in user-code.} Consider the struct in \autoref{lst:self-referential}, in which we store some data and additionally provide a view into a part of the data. The \inline{window} field contains a reference and it must be tied to soe value on the stack, but it is impossible to assign a lifetime to it. Putting a \inline{'data} is not valid Rust at the moment. - -With regions, however, this could work. Polonius could check that when creating a \inline{View}, the reference supplied to the constructor is pointing to the \inline{data} argument. - -\begin{lstlisting}[language=Rust, caption={Self-referential struct}, label={lst:self-referential}] -struct View<T> { - data: Vec<T>, - window: &'data [T] -} -\end{lstlisting} - -\subsubsection{Polonius the Crab} - -Since Polonius is still unstable and will be so for the forseeable future, there is a library\footnote{\url{https://docs.rs/polonius-the-crab/0.3.0/polonius_the_crab/}} defining a macro which implements the Polonius logic in stable Rust, using some unsafe code. - -\subsubsection{Cyclone} - -\emph{Cyclone} \cite{grossman2002region} is a dialect of C that introduces additional safe pointer types which are checked using regions. Instead of Polonius’ abstract regions, the regions in Cyclone represent concrete parts of memory. +\subfile{1-1-6-loans-and-regions} \section{Contracts and Refinement Types}\label{sec:contracts-refinements} @@ -624,126 +38,9 @@ Strong typing rejects some correct programs but also prevents many potential bug In this section, we will focus on two related approaches of specifying contracts, that are very useful without being overly complex: Contract programming inserts explicit run-time checks into a program to ensure that it is never in an illegal state. Refinement types introduce compile-time checks. Finally, we will discuss hybrid typing, an approach to combine contract programming and refinement types. -\subsection{Contract Programming} - -\todo[inline]{Example: Assertions} - -Programming languages like Eiffel and D have native support for contract programming. There is also a library\footnote{Found under \url{https://gitlab.com/karroffel/contracts}} that brings design by contract to Rust. Contract programming enables three tools: - -\begin{enumerate} - \item \emph{Preconditions} are checked at the beginning of some piece of code, - \item \emph{Postconditions} are checked after some piece of code has run, and - \item \emph{Invariants} must always hold for some piece of data. -\end{enumerate} - -\todo[inline]{Example bank transfer: A bank account may never have a balance below lino of credit, a transfer precondition is that there is enough money in the first account, a postcondition is that the balancs are updated. Use Eiffel, D, or Rust/contracts?} - -\todo[inline]{loop invariants/conditions} - -\todo[inline]{Higher-order contracts: \href{https://www2.ccs.neu.edu/racket/pubs/icfp2002-ff.pdf}{Findler, Felleisen}} - -\subsection{Refinement Types} - -Since the inception of set theory, mathematicians used \emph{set comprehension} to build sets. For example, - -\[ - \Set{ n | n \text{ is a prime number}} -\] - -\noindent denotes the set of all prime numbers. On the left-hand side of the $|$ is a comprehension variable and on the right-hand side is a predicate. The above is an example of \emph{unrestricted comprehension} because $n$ can stand for any mathematical object. Allowing unrestricted comprehension leads to Russel’s paradox, though, which is why it is not used in modern mathematics anymore. There is a weaker version of comprehension which works well, \emph{restricted comprehension}. In it, the variable has to be restricted to a set, like the following: - -\[ - \Set{ n ∈ ℕ | n \text{ is a prime number}} -\] - -\noindent This means that for every set $X$ and every predicate $P$, - -\[ - \Set{x ∈ X | P(x)} ⊆ X -\] - -\noindent which circumvents the paradox. \emph{Refinement types} are the type theoretic analogue to restricted comprehension. Some programming languages allow for refinement types. For example, in the Lean theorem prover, the introductory example would translate to - -\begin{listing}[H] - \begin{minted}[mathescape]{lean} - { n : $\BbbN$ // prime n } - -- Syntactic sugar for: - subtype nat prime - \end{minted} - \caption{Refinement types in Lean} - \label{lst:lean-subtype} -\end{listing} - -\inline{subtype} is a type constructor, that is a generic type, with two arguments. The first argument \inline{nat} is a type, but the second argument \inline{prime} is a value of function type. Rust has limited support for types depending on values with \emph{const generics}, but it does not support functions as const generics, which means that refinement types can’t be implemented in Rust this way.\todo{More on full dependent types, Pi and Sigma types, lambda cube?} There have been proposals for introducing more const generic capabilities to Rust\footnote{\url{https://github.com/ticki/rfcs/blob/pi-types/text/0000-pi-types.md}}, but they are put into cold storage indefinitely. We can still imagine what the equivalent Rust code would be if it could be written: - -\begin{listing} - \begin{minted}{rust} - struct Prop<const B: Bool>; - - impl<const B: bool> Prop<B> { - const fn new() -> Self { - if B { - Self - } else { - panic!() - } - } - } - - struct Subtype<T, const pred: fn(T) -> bool> { - val: T, - proof: Prop<pred(val)> - } - - impl<T, const pred: fn(T) -> bool> Subtype<T, pred> { - const fn new(val: T) -> Self { - Self { - val, - proof: Prop::new::<pred(val)>() - } - } - } +\subfile{1-2-1-contract-programming} - // Also implement `Deref` and `DerefMut` so that a - // `Subtype<T, pred>` can be used in any place where a - // `T` is expected. - \end{minted} - \caption{The Rusty equivalent.} -\end{listing} - -\noindent\inline{Prop} is a helper type which can only be created from a true value. We need this so that we can proof that \inline{val} makes \inline{prop} true. This would now allow us to create constants that have certain proven properties like in \autoref{lst:proven-const}. - -\begin{lstlisting}[language=Rust, caption={Constants with proven properties}, label={lst:proven-const}] -const PRIME: Subtype<i32, is_prime> = Subtype::new(17); -const SMALL: Subtype<u32, |x| x <= 255> = Subtype::new(9); -\end{lstlisting} - -\noindent Because this is cumbersome, it would also be nice to extend the syntax for easier definition of a refinement type and let the compiler automatically insert the \inline{Subtype::new}. - -\begin{lstlisting}[language=Rust, caption={Constants with proven properties}, label={lst:proven-const-pretty}] -const PRIME: Subtype where is_prime = 17; -const SMALL: Subtype where |x| x <= 255 = 9; -\end{lstlisting} - -\noindent Refinements could of course only be checked at compiletime for constants. If a user wants to create a runtime value, the checks are also at runtime. - -\subsubsection{Refinement Types in Functions} - -In the previous example, the predicate only took one input, namely the comprehension variable itself. And as long as only simple types are refined, this is the only possibility. But refinement types really shine when they depend on multiple inputs. The prime example is a function concatenating two arrays: - -\begin{lstlisting}[language=Rust, caption={Concatenate two arrays, with a deprendent output type}, label={lst:concat-refined}] -fn concat<T, const N: usize, const M: usize>( - first: &[T; N], - second: &[T; M], -) -> [T; N + M] { - // ... -} -\end{lstlisting} - -\noindent Here, the type of the return value depends on the generic values of the inputs. This, again, only works if \inline{N} and \inline{M} are known at compile time. - -\todo[inline]{Runtime dependent function (e. g. reverse?)} -\todo[inline]{Intersection types} +\subfile{1-2-2-refinement-types} \subsection{Liquid Types and SMT Solving} \begin{enumerate} @@ -810,4 +107,6 @@ fn do_stuff_if_some<T : Debug>(x : Option<T>) { } println!("{:?}", x?); // Infallible at this point } -\end{lstlisting} \ No newline at end of file +\end{lstlisting} + +\end{document} \ No newline at end of file diff --git a/thesis/thesis.pdf b/thesis/thesis.pdf index 8b680559ab9a5f9dea6c27c2adcf396581b5aecf..6b7b1ea6bc0137311e1665d04997372f5bec5f40 100644 --- a/thesis/thesis.pdf +++ b/thesis/thesis.pdf @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:de4a21c7a52e4dbaa0695fd49499a49cd9f4dd23a4d8fcefc296b218d1cd2a7e -size 274 +oid sha256:0a28168ad5751b468c4a7f47b678d3729290f472766c33abdd7c07a38d1af964 +size 178188 diff --git a/thesis/thesis.tex b/thesis/thesis.tex index 23243965bf78bc047420d62f5e278b5a4d6a9e9d..c2bb4d089ed1ec1c1f22491e40f0a2fcaa871b0d 100644 --- a/thesis/thesis.tex +++ b/thesis/thesis.tex @@ -14,7 +14,7 @@ \tableofcontents -\include{1-introduction} +\subfile{1-introduction} \section*{Abbreviations}