It's approximately the second occurrence of 9:26 in my timezone, so I'm making a thread, mostly so I can rant about misunderstandings about appearances of finite sequences in the digits of pi. http://en.wikipedia.org/wiki/Pi So approximately every 2pi radians around the sun or so, we come across the meme that pi contains every single finite sequence of decimal digits somewhere in its expansion. This may be true! But it's a mathematical statement, and thus is subject to a test that is, if not stronger, perhaps perpendicular to truth: provability. Mathematicians aren't supposed to accept statements unless they're proved or are specifically stated to be unfounded assumptions for the purpose of hypotheticals or counterfactuals. So let's examine what we can say about the digits of pi. Pi is an irrational number, which means that there is no finite string of digits such that pi is eventually just that string repeated over and over again. Some people take this to mean that pi is "random", but that's not the case. Pi is a constant. We can compute what its digits are, so it's certainly not unpredictable in any way once we know that we're dealing with pi. Furthermore, merely being irrational doesn't mean its patternless. Champernowne's number, 0.1234567891011121314... is irrational, but I'm pretty sure we can all guess what comes next. http://en.wikipedia.org/wiki/Champernowne_constant Champernowne's number also has the property that, given any finite string of decimal digits, you're going to find it eventually somewhere in the expansion. This property is called being a Disjunctive number. http://en.wikipedia.org/wiki/Disjunctive_sequence Puzzle 1: suppose the string starts with a bunch of 0s. Champernowne's number is built out of integers that don't start with a bunch of 0s, but the string appears anyway. Where? It is also not the case that pi, by virtue of being irrational, is necessarily disjunctive. The Liouville constants, which are some nonzero digit in the (k!)-th digit and 0 everywhere else, are irrational, but if all of those nonzero digits are 1 then we certainly don't get every finite string. http://en.wikipedia.org/wiki/Liouville_number Before anyone asks, pi is also transcendental, but that also doesn't tell us anything. Champernowne's number and the Liouville constants are all transcendental. Actually, we expect pi to have a property that's stronger than being disjunctive: we expect pi to be normal. A normal number is one where for any fixed k, the strings of length k occur equally often. This is impossible for finite expansions, because you can pick k to be the exact length of the expansion and that's the only string of that length that can occur, so a normal number must have an infinitely long expansion. http://en.wikipedia.org/wiki/Normal_number So what do we know about pi? Very little! We have methods for computing the digits of pi out to any finite accuracy we want, but we don't know much about the infinitely long tail that remains uncomputed. For instance, we don't even know that, say, the digit 7 appears an infinite number of times. http://www.maa.org/sites/default/files/pdf/pubs/BaileyBorweinPiDay.pdf So what, one may ask, if there are only, say, 10^{106} 7s or whatever? Well, it doesn't matter much to most people, but to a mathematician, the rebuttal is that almost all* finite strings of decimal digits contain at least 10^{106}+1 7s in them, so if there are only 10^{106} 7s, then pi is missing almost all finite strings! That's certainly not disjunctive at all! Speaking of almost all, almost all** real numbers are normal. This tells us absolutely nothing about pi, because pi is not a randomly chosen real number, anymore than 1 is, and 1 is certainly not normal. Contrast with the fact that almost all real numbers are noncomputable, but we have several dozen methods for computing pi. So any property of "almost all real numbers" could hit or miss pi. There have been statistical tests performed on the known digits of pi, and they are consistent with being disjunctive, even consistent with the stronger statement of normality. But this is a statistical test, and while that level of accuracy might be good enough for scientists, science is not math, and the tests that work for scientists don't work for mathematicians. Other numbers expected to be but not known to be normal: e, log(2), sqrt(2), Euler's constant, or indeed almost any real number not specifically created to be normal. See also: http://mathoverflow.net/questions/23547/does-pi-contain-1000-consecutive-zeroes-in-base-10 *Almost all in the following sense: consider the proportion of strings of length at most k that contain 10^{106} 7s out of all of the strings of length at most k. As k goes to infinity, the proportion goes to 1. **Almost all in the Lebesgue sense, i.e. the complement has measure 0. A perhaps more intuitive picture: if you have a function f that is 0 at normal numbers and 1 at nonnormal numbers, and you integrated f from -infinity to +infinity, you'd get 0.
I note that a lot of human evaluations of "probability" are much influenced by the distinction between "looks random" and "looks non-random". It is intuitively obvious to nearly everyone that you're more likely to encounter "2791473" in a given region of something like pi or e than you are to encounter "0000000", but they are in fact equally probable, in general. The distinction is that one of them is more "notable", but notability is not a thing we have coherent definitions for. The famous joke is that there are no uninteresting positive integers; consider that if there were, there would be a smallest uninteresting positive integer, and that would be interesting.
re: probability When Baby first learned to play poker she got a hand that was A K Q J 10 all of the same suit. She thought it was special, until it was pointed out that it had the same odds as any other hand. She became addicted to math.
We can actually distinguish between more or less notable strings in at least one way, although it's not entirely natural. Kolmogorov complexity with respect to a given programming language* is the minimum length of a program in that language needed to output the given string*. The uncomputable numbers mentioned in the first post are numbers with infinite Kolmogorov complexity in any Turing-complete language. Our understanding of randomness tends to believe that strings with higher Kolmogorov complexity are more expected in random sequences, at least partially because Kolmogorov complexity aligns somewhat with predictability; an easily predictable sequence often has a short program to generate it, and that short program thus puts an upper bound on the Kolmogorov complexity. There's also the fact that we have a naive, learned distribution on strings and thus a sense of entropy on strings via Shannon information. Strings with high Shannon information tend to be viewed as unpredictable and thus "more likely to occur" in a sequence we're told is random, just as you're supposedly more likely to end up in a high entropy physical state than in a low entropy one, even though the probability of landing in any particular physical state has nothing to do with the entropy. So we believe that 000000 is unlikely, and we believe that a royal flush is unlikely, and they are, it's just that "not 000000" and "not a royal flush" are large sets of things that are, individually, equally unlikely, but are harder to distinguish because we don't care. *Assuming that the language has a finite number of distinct characters. **Technically not output, because the string is allowed to be infinite, but rather, given any number n, the program will in finite runtime output the first n characters of the string but not necessarily halt after doing so. ***My posts are going to be full of footnotes like this regarding technical definitions and caveats, because the caveats are often the best parts of pure mathematics.
I am pretty sure a royal flush is "special" even if it's not statistically less likely than any other specific set of five cards. There is a fascinating result I saw once, which is that the most even distribution of suits in a bridge hand is 4-3-3-3, but the most common is 4-4-3-2. Double-checked that. (I originally picked it up as a sample of "interesting things in HAKMEM" from the Jargon File, I think.)
So definitions. Terminology and vocabulary are what we use to communicate ideas. Those of us who like to be understood use definitions to explain what a given piece of terminology means, and those of us who like to understand will ask for definitions if we don't know a given piece of terminology, or suspect we know it in a different way than the speaker does. A lot of arguments in academia or academic settings are often about definitions of things, about usage of terms and such. "Words have meanings" blah blah blah. Mathematicians also use a lot of definitions. But I think, and many of my colleagues agree, that a "definition" in mathematics is somewhat different from a "definition" in the colloquial sense. A definition in the colloquial sense tries to attach a meaning to a term. Colloquially, we have experiences, and we have what we consider to be the real world, and these provide external references that we can use to attach meanings to the words that we use to communicate. We can say "rabbit" and point to an rabbit, and some fairly common assumptions will tell the listener that we mean "rabbit" and not, for instance, "bundle of rabbit parts", and the listener will thus get an idea of what "rabbit" means, and from these kinds of concrete, demonstrable things we build up more complicated, more abstract definitions. And then there's a large amount of leeway because the example rabbit is not all rabbits, real and possible, dead and living and unborn, and because the building up is fairly sloppy, relying on people having roughly the same set of concrete definitions. But the hope is that the definitions match pretty well, for some unspoken, undefined notion of "pretty well". In mathematics, we also build more complicated, abstract definitions from simpler definitions, but there's nothing at the bottom. No rabbits for us. We tried. At the end of the 19th century and the beginning of the 20th century, there was hope that we could firmly ground mathematics without necessarily tying it to physical objects. But results in mathematical logic and philosophy proved that the endeavour was doomed, and later results* showed that it was a bad idea anyway. But without a firm ground, without those concrete definitions, we can't rely on meaning.** But we still like the notion of "truth". In the absence of meaning, truth is hard to get at, so instead what we do is proof. And indeed, that is one way of looking at proof; as one of my colleagues describes it, a proof is "truth without meaning, without content". Most proofs are for mathematicians, each of whom has an idea of what any given term in their area of interest means. But there is no need for them to have the same idea, or at least that is the hope. Rather, a proof of a theorem is intended to convince a given mathematician, with a given understanding of the terms involved, that the statement of the theorem is true when interpreted via their understanding of those terms. This includes computers, which, as far as we know, assign no meaning to terms; they just trace through chains of definitions. So if we want to prove things to computers, we need to have all those definitions in place, even if at the end they don't have any meaning to them.*** Humans don't like this most of the time. A given mathematician, armed with their given idea of what the terms mean, doesn't usually want most of the definitions, because that idea does most of the work that the definitions are supposed to do. When you say "sphere", most mathematicians summon up an idea of what a sphere is pretty immediately, and will understand when you talk about properties of the sphere in terms of that idea. But those ideas of what a sphere is can be quite different, not incompatible but with very different associations and connotations, and the theorem in question might not be true for both ideas. So you need to tighten the definition to make sure that everyone is thinking about the right kind of sphere, including the computer which isn't thinking about any ideas and is just looking at the words. The precision that mathematicians are famous for is a necessity, not a joy. So the end result is that mathematics often relies on a lot of definitions and mathematicians are often unhappy with this. In the words of seeb's mathematician father, "a definition is the obituary of an idea", but that's the price we pay for trying to get truth that's independent of meaning. Note: The above might seem to imply that I believe that colloquial, meaning-based definitions are less prone to two people working from the same definition coming to different understandings. I don't believe that at all; I am convinced that given two people working from the same definition, they might come to perfectly equivalent understandings, indistinguishable just by talking to them, but that this is rare and that the best we can hope for is that the mismatch is with regards to edge cases while the usual case is that the mismatch is hard to convey in words. Mathematicians are just more open about the issue, trying to use it as a feature rather than regarding it as a bug. *E.g. category theory, a lot of which is looking for provably valid analogies. **Note that this is with regards only to the externally visible parts of mathematics, the communications and proofs and such. What a given mathematician believes regarding what happens when they do mathematics can be quite different. ***Think of Serle's Chinese room, wherein an English-speaking man who understands no Chinese gets messages in Chinese, matches the symbols to a list of rules he has in English, and then makes a response, again in Chinese, according to this set of rules, in a way that an observer who understands Chinese would think that there is a person who understands Chinese inside the room, rather than the man who understands no Chinese and the list of rules. We need to write that set of rules. Usually this thought experiment is supposed to ask, since the man understands no Chinese, where is the understanding in this situation?
Reminder to myself to make a post about the hole at the bottom of mathematics. Because it's hilarious that not only do we not know what we're talking about, we can prove that we can't know what we're talking about, and we can prove that we can't know to what extent we don't know what we're talking about. Also should probably make a post about network theory, since that's a good example of applications of mathematics and mathematical thinking to real-world problems in a way that people don't see at all given the odd emphasis on calculus as the end goal of math education.
@Exohedron: incompleteness theory? I like that we can prove that we can't reach every truth by working from base principles.
Yeah, I'm going to talk about the incompleteness theorems and the rather bizarre set of things that do have complete theories. This has interesting consequences in things like computer science and ethics and their intersection: how to keep artificial intelligences from destroying/enslaving humanity.
The hole at the bottom of the mathematics. In my previous post I mentioned that there's a hole at the bottom of mathematics in that ultimately we don't know exactly what we're talking about; we can't distinguish between things that act the same way or that are exactly translated into each other. For the most part we don't care; that's the power of mathematics, to strip away the details of "what is" and tell you "what does". Who cares if you're looking at water or money or electrons or racecars, the underlying behavior is the same and the mathematics only looks at the behavior. But there's a bigger hole, and this one might be important. Back in the 19 and early 20th century, mathematicians dreamed that we could put mathematics on a solid foundation. We could say "all mathematics is describable using this small collection of things, and so we only need to know about this small collection of things." We settled on sets, which are collections of things, and it won't help to try to go any further as to what they are. Instead, we tried to nail down what they do. So for instance, we think: given a set, we can think about all of the elements of the set that have a given property, and make a set of all and only the things that have that property. Given the set of horses, we can look at the subset of white horses. That should be fine, right? And if we have two sets we should be able to consider the set containing everything in both sets. So what about all sets? Is there a set of all sets? There should be, if everything is going to be a set. But alas. Namely, suppose that we had a set of all sets. Then for any property we can think of, we can look at the set of all sets that have that property. How about the property of all sets that do not contain themselves? Let's consider the set S of exactly all sets that do not contain themselves. And you can say: sets can't contain themselves. Which is great, because that means that every set is in S. So since S is a set, S is in S, and therefore S contains itself and therefore cannot be in S. So S doesn't contain itself and therefore belongs in the set S. And therefore... This is Russell's paradox, telling us that we can't make a set of all sets. The thing of all sets must be something other than a set. "All sets" is not an object we can manipulate mathematically! So we can't just make everything a set. We run into the problem of, now that we know that some things aren't sets, where do we put the boundary? Where do we say "these things are not sets"? It turns out that there isn't a good way to do find a boundary; anything is either too inclusive, including some things that lead to paradoxes, or too exclusive, leaving mathematics too weak to be useful or interesting; often both. There are branches of mathematical logic devoted to figuring out what goes wrong when we assume that a given thing is a set. The current compromise is called the Zermelo-Fraenkel set theory*. The thing of all sets is called a proper class, and we can't manipulate proper classes in ZF. But we do have some rules for how sets work, and these are fine for the most part, and most importantly we can't do ridiculous things like Russell's paradox. So we don't know what sets are, and we don't know what sets aren't. We did have a decent idea of how sets behave. We could translate things into set theory, and we could prove them in set theory. It took forever to do even basic things in set theory, but that's because set theory is the machine language of mathematics; you don't actually want to do mathematics in set theory, you just want to know it's there and that it is, in theory, possible to translate into and out of it. This post is kind of long already, so the next part, about Godel, will appear later. *Often we include the Axiom of Choice, but that leads to fun things like the Banach-Tarski paradox, where you take a ball, you split it into nine parts, you move the parts around using just rotation and sliding, moves that don't change the sizes of things, and you can reassemble the parts into two distinct balls, each the size of the original. So some people aren't happy with Choice and don't include it.
Before we go into the terrible things that happened with Godel, I want to take a short detour into counting. What does the "size" of a set mean? Well, we usually start with some bunch of objects and say "1, 2, 3,..." until we run out of objects. Or until we get bored and just guess. Mathematicians aren't allowed to get bored or guess, but we also aren't always going to run out of objects. If someone were to ask "how many numbers are there?", we need to find an answer. The eventual answer that we use these days came from Georg Cantor. His idea was that we talk about the sizes of sets by comparing them to other sets. If we had the set {a, b, c, d} and the set {P, Q, R, S}, then we want to say that they have the same number of things in them. And we can do that by matching them up, say, matching a to P, b to Q, c to R and d to S. Or alternatively we could match a to S, b to R, c to Q and d to P. In either case we have what is called a bijection, a one-to-one correspondence of things in the first set with things in the second set. Everything in {P, Q, R, S} is matched to something in {a, b, c, d}, and no two things in {a, b, c, d} are matched to the same thing in {P, Q, R, S}. We say that two sets have the same size, called "cardinality", if there is a bijection from one to the other. You can check that if there's a bijection from one set to another set, then there's automatically a bijection from the second set to the first set. This works fine for finite sets, but what about the set of natural numbers? {0, 1, 2, 3,...}? Well, if we had any finite set then we'd run out of things before we could match every natural number. So {0, 1, 2, 3,...} is infinite. We call its cardinality aleph_{0}. We also call this countable infinity, because it's the set of counting numbers. What about the set of even natural numbers? {0, 2, 4, 6, 8,...}? That's contained in the natural numbers, and is missing a bunch of numbers. So surely it must be smaller? It turns out that nope, it's the same size, because we have a bijection. We match each natural number n with the even number 2n. To 1 matches to 2, 2 matches to 4, 3 matches to 6, and so on. So for each natural number there is an even number, and vice versa. So there are just as many even numbers as there are natural numbers! And similarly, there are just as many odd numbers as natural numbers. So we have that aleph_{0} + aleph_{0} = aleph_{0}. Similarly, we can show that the integers, positive and negative, again have cardinality aleph_{0}, since the integers are basically just two copies of the natural numbers. The tricky ones are whether there are the same number of rational numbers and the same number of real numbers. For the even numbers, between any even number we're only skipping one natural number, so it's not so strange to think that we can make them up later. But for the rational numbers, well, between any pair of natural numbers there are infinitely many rational numbers, and even more real numbers. But the cardinality of the set of rational numbers is just the same as the cardinality of the natural numbers! The construction works best with a picture so I'm not going to try to reproduce it here. It boils down to writing each rational number as a pair of integers p/q, and then drawing a path on the plane that hits all the points with integer coordinates; the number p/q is matched with how far along the path you have to go to get to the point (p, q). What about the real numbers? How many real numbers are there? Here's where Cantor's first big result comes in. It ends up looking a lot like Russell's paradox, which is why I wanted to mention this whole thing about cardinality. Specifically, we look at all the numbers between 0 and 1, written in binary. So we get decimal points followed by infinitely long strings of 0s and 1s.* If it's not infinite, just tack on an infinite string of 0s. Suppose that we had only countably infinitely many real numbers between 0 and 1. Then we could match each real number with a natural number, saying maybe that 0.10000... matches to 0, and 0.11010101... matches to 1, and 0.10010001000.. matches to 2, and so on. Write them in a list. Now we construct a new number not on our list. We say "for the nth digit of our new number, look at the nth digit of the nth number on the list, and make the nth digit of our new number different". So, based on our listing above, our new number would start with .0, since the first digit of the first number is 1, and then we'd say 0, since the second digit of the second number is 1, and then we'd say 1, since the third digit of the third number is 0, and so on, getting .001... So now we have a number where, for any number n, the nth digit does not match the nth digit of the nth number, so it can't be equal to the nth number. So it can't be on our list! So no list we make can have all of the real numbers between 0 and 1 on it; since making the list is the same as assigning to each string a natural number, that means we can't make a bijection between the reals and the natural numbers. The set of reals is strictly bigger than the set of natural numbers! So we say that the set of real numbers is uncountable, since we can't even put them on a list. This gives a general construction. We can match strings of 0s and 1s to subsets of the natural numbers: the subset for a given string contains the natural number n if the nth digit of the string is 1. For a given set S, we call the set of subsets of S the powerset of S, written P(S), and we've just shown that P(natural numbers) is the reals, and has cardinality bigger than the natural numbers. But with subsets we can keep doing the same thing. Suppose that a we have a set S, and we have the subsets of S. Assume that S and P(S) have the same cardinality. Then we have a bijection, call it f, from elements of S to subsets of S. Now look for the subset R of S such that if k is contained in R, then k is not contained in f(k). This is the equivalent of our new number. Since we assumed that f is a bijection, there must be an element r of S such that f(r) = R. Is r in R? This is Russell's paradox in miniature, and indeed Russell was heavily inspired by Cantor's work. It gives us that there are always more elements in P(S) than there are in S, that P(S) has strictly greater cardinality. So we have that the real numbers, being the powerset of the natural numbers, has greater cardinality. The powerset of the real numbers is even larger, and the powerset of the powerset of the real numbers is larger still. So we have a tower of infinities, each larger than the last. How many? Too many; the collection of sizes of sets turns out to not be a set by a paradox similar to Russell's. He left one big question open after everything he did. He showed that there are infinitely many infinities going upwards. But what about between the infinities he's already found? Are there any cardinalities between aleph_{0} and the cardinality of the real numbers, often called the Continuum? He thought that there weren't, but couldn't prove it, which added terribly to his anxiety. The Continuum Hypothesis, as this belief came to be known, was first on David Hilbert's famous list of open problems in 1900. The problem was "resolved" in 1963**. I'll talk about the resolution in another post, because it is one of the big holes in mathematics. Cantor got a lot of flack for his work on transfinite mathematics. Notable mathematicians scorned him as playing at nonsense. He got in trouble with the church, since showing that there was this tower of infinities threatened blasphemy; God is infinite and absolute, but if there's always a larger infinity...Cantor was in and out of sanitoriums, suffering from bad health and depression, and while he managed to get a position at the University of Halle, but retired after 30 years and died in poverty in a sanatorium in 1918, his work still under fire from respected parts of the mathematical community. While he's most well known for his set theory work, he also made contributions to several other fields of mathematics, influenced by his work on infinite sets. Who knows how he would have reacted to the resolution of the Continuum hypothesis. The moral of the story is that Cantors Georg was an outlier and shouldn't have been countable.*** *If we have an infinite string of 1s starting at some digit, that's the same as a 1 at the previous digit followed by an infinite string of 0s. It's the 1 = 0.99999.... thing, but in binary. **The quotation marks are intended to be scare quotes. ***Yes, this is the entire reason this post exists.
That statement was actually my very first reaction to the Spiders Georg meme, but I've never had a chance to use it until now.
Okay, now I'm going to attempt to describe Godel's Incompleteness theorems. Attempt, because they're convoluted and subtle. So I've discussed the hole at the bottom of mathematics, where we say everything is sets and then can't say what sets are. This made a lot of people sad, but we've since gotten over it, to some extent. At the very least, even if we can't say what we're building on top of, we can at least be assured that the building itself is sound and sufficient, right? This was Hilbert's second problem, to show that the theory of arithmetic was complete and consistent. His first problem, the Continuum Hypothesis, may have been somewhat esoteric, but this problem was of obvious importance, since showing that arithmetic was either incomplete or inconsistent would be a terrible blow to the trustworthiness of mathematics. So you can probably guess how this story ends. Let's talk about arithmetic then. We can add natural numbers, and we can multiply natural numbers, and there's (countably) infinitely many of them. We can also use them to encode mathematical symbols. For instance, we can write 0 as 100 and 1 as 101, and + as 200, and since we can write any nonzero natural number using only 1s and +s, we can write any natural number using the three encoding symbols that we have. 2 becomes 101200101. We can also use other encodings for other symbols, for instance, writing 300 for =, and 600 for ( and 900 for ) and 500 for a dummy variable, call it n maybe. And we can now encode first-order* statements as numbers, and we can encode sequences of statements as numbers. We call this the Godel encoding, that sends the statement S to the number G(S). Each of our axioms becomes a number, and we can say that if a statement (possibly a long compound of statements) S proves a statement T from the axioms, then G(S) numproves G(T). The claim that a statement S is provable can be translated into the claim that there exists a number t that numproves G(S). But this claim, being an arithmetical statement about numbers, has its own Godel encoding. So the statement "S is provable" has a Godel encoding, and now we can use numbers to talk about what we can prove. In particular, we can define a property B such that B(y) is true if y is the Godel numbering of a statement and is provable. Now we do the Cantor/Russell diagonalization trick and feed B to itself. We're looking for a variant that will give us an S such that S is true if and only if B(G(S)) is false. Note that S is not equal to the negation of B(G(S)). Rather, it states that if you do a certain computation, you get the Godel encoding of an unprovable statement; but the resulting number just happens to be the Godel encoding of S. A similar English statement is: "is unprovable when proceeded by its own quotation." is unprovable when proceeded by its own quotation. Anyway, we end up with the statement S that states that if S is true, then S is unprovable. Suppose that S is true. Then S is unprovable, because that's what S claims. Suppose S is false. The only way for it to be false is if the conclusion is false; since the conclusion of S is that S is unprovable, rejecting the conclusion tells us that S is provable, and yet false. So either we have that our system is incomplete, in that there are true statements that it can't prove, or we get that our system is inconsistent**, in that it can prove false things. We would really not like the second option, so we go with the first and say that not all true things are provable. But we can always simply add S to our set of axioms, right? Since we're assuming consistency? Well, then we get a new relation, numproves_with_S, and a new property B_with_S, and then do the diagonalization trick and get a statement T that states that if T is true, then T is unprovable even if we assume S is true. And we can keep going like this. That's Godel's First Incompleteness theorem, and it shows that our building will never be high enough. For any system of axioms that can model the natural numbers, we get a Godel encoding, and with it we get a statement "If I am true then I am unprovable from the axioms being used". The Second Incompleteness theorem builds on the first. We assumed consistency, so that S was true, but unprovable. It would be nice to know that we can actually assume consistency. Can we prove it? Nope. Suppose that our system can prove its own consistency. Then it can prove that S cannot be false, since otherwise the system would be inconsistent. But then we have a proof of S, and thus S must be false, and so our system is inconsistent! So we have that no system that can model the natural numbers can prove its own consistency! Well, fine, we'll just add in consistency as a new axiom. But alas, we're now in a bigger, more powerful system, and we can't prove that this new system is consistent using what we have. We'd have to add in an axiom that the new system is also consistent, and find ourselves in a yet bigger system. This is even worse than someone who tells you that he's telling the truth; at least that would be consistent with being both a liar and a truthteller. Here the assertion "I am consistent" cannot be consistent asserted by the system even if its true. So Godel's two theorems tell us that it is not true that all true things are provable, and it is not provable that all provable things are true! It might be true that all provable things are true (but we'd never know), and it might be provable that all true things are provable (but then we'd necessarily be inconsistent). A variant of this construction, using the feed-your-own-code setup, is the Halting problem in Computer science, named after William Halting***. Given a program, or a Turing machine, or whatever, if you run it on a given set of inputs you can ask whether it will ever finish running (ignoring things like memory limitations or tech support wondering where all the CPU cycles are going). Some programs eventually halt, some do not. Of course, being programmers the obvious thing to do is to make a program H that will read the code of another program and tell you whether that program halts. And now we use H to get a program S such that S halts if the input code is for a program that does not halt, and S goes into an infinite loop if the input code is for a program that does halt. And then we feed S its own code. What happens? And similarly to how we get a tower of incomplete, possibly inconsistent mathematical systems, we can add in H as an oracle, i.e. a function with no program behind it, and get a new programming language that allows us to ask if programs written in the old programming language halt. And now we can ask if programs in the new programming language halt, and on we go. Another variant is with ethics for artificial intelligences. Some people are worried about artificial intelligence and how to make sure that it's friendly, or at least not unfriendly. But can we prove that even if we create a friendly artificial intelligence, its offspring will also be friendly? Or is it sadly the case that we can prove that we can't prove that the offspring will be friendly? Yudkowsky (of Harry Potter and the Methods of Rationality fame, amongst other things) and colleagues at the Machine Intelligence Research Institute have put out some papers indicating that the sad case is the one we have to live with. I haven't read them so I don't know if the proof holds, but given Godel's theorems it doesn't seem implausible. There are a few things being swept under the rug here, namely the arithmetic nature of numproves. There's a bunch of ways to state what we mean by arithmetic nature, but it boils down to being Turing-computable. So we've just gotten that there's no automated theorem prover for arithmetic; while a Turing machine can wander around the set of possible statements applying axioms and laws of inference, it won't necessarily hit every true statement. Similarly provability is undecidable; no program can take statements about the natural numbers as input and always say whether they are provable or unprovable. A similar theorem, Tarski's undefinability theorem, states that there is no way to define, using just arithmetic, the notion of "truth" in arithmetic. There are a number of other axiomatic systems which are complete and decidable. For instance, the theory of algebraically closed fields is decidable, because you can't isolate the natural numbers using only the things available in the language of algebraically closed fields. Similarly, Euclidean geometry is decidable, because you can't build the natural numbers out of incidence relations between lines and points. Presburger arithmetic, which has defined addition but not multiplication, is also decidable. But Hilbert's hope, that we could find a decision algorithm for statements about the natural numbers with addition and multiplication, is dead. Our building is neither sound nor sufficient. This also opened the door to the notion of "independence", statements that could be either true or false for a given system of axioms. What do we mean by this? We mean that we can find a bunch of sets that obey the given system of axioms and for which the statement in question is true, and another bunch of sets that obey the given system of axioms and for which the given statement is false. This is the resolution of the Continuum Hypothesis: it's independent from the standard ZF axioms of set theory; there are versions of set theory for which there are no infinities between the cardinality of the natural numbers and the cardinality of the real numbers, and there are versions of set theory where there are however many you want. And just as with true and provable, there's no way to figure out ahead of time whether a statement will end up being independent or not. But despite this, despite the hole at the bottom and the hole in the supports and the hole at the top, mathematicians keep building anyway, because as far as we know the building hasn't collapsed yet. *First order, as in we can say "for each natural number" or "there exists a natural number such that" but not "for any subset of natural numbers" or "there exists a set of natural numbers such that". This also means we can't say "for any property that natural numbers can have" or "there exists a property that some natural numbers have such that...". The usual notion of induction is a second-order statement. We can make a first-order version by including an infinite number of first order statements, one for each possible statement about individual natural numbers. I'll talk a bit about this if I ever get around to talking about the nonstandard natural numbers. **Technically, not ω-consistent. Being consistent is not proving contradictions. Being ω-consistent is stronger than merely not proving contradictions. This is relevant to the last note, in that ω-consistency is like having the second-order version of the induction axiom, while regular consistency is like only having the infinitude of first-order statements. ***Speaking of things being false...
Two things that look like derivatives: Note: this post requires some familiarity with linear algebra and multivariable calculus. Not super technical stuff, but at least a familiarity with the concepts. In calculus, we usually define a derivative via some sort of limiting process. Take a function, evaluate at two points, take the difference of those values, and divide by the difference between the two points themselves, and then take the limit as the two points approach each other. And we end up with an operation that takes a function and spits out a function and obeys a bunch of rules. For instance, we have the addition rule: For functions f and g, (f + g)' = f' + g' We also have the scalar multiplication rule: For a function f and a constant number c, (cf)' = c(f') Together these tell us that differentiation is a linear transformation on the vector space of functions. So that's addition and scalar multiplication. But we can also multiply functions together, and so we come to the product rule: (fg)' = (f')g + f(g') If f and g are vector-valued functions then this rule holds for both scalar (dot) products and vector (cross) products. We say that any linear transformation that obeys the product rule is a "derivation". So differentiation is a derivation. But there are other things that are also derivations. Let's talk about linear algebra. Let's consider n-by-n matrices. You can add two n-by-n matrices to get another n-by-n matrix, and you can multiply an n-by-n matrix by a constant, so the set of n-by-n matrices forms a vector space. If A and B are matrices, then we can look at AB - BA, which is again an n-by-n matrix. I'm going to write that as [A, B], which is called the "bracket" of A and B, and also as ad_{A}(B). You can check that ad_{A} is a linear transformation on the vector space of n-by-n matrices. Now let's look at three matrices, A, B, and C. BC is an n-by-n matrix, and we can look at ad_{A}(BC) = [A, BC]. This is equal to ABC = BCA. Subtracting and adding BAC gives ABC - BCA = ABC - BAC + BAC - BCA = [A, B]C + B[A, C] = ad_{A}(B)C + B ad_{A}(C) So ad_{A} is a derivation! What can we say about ad_{A}? Well, it's kind of weird, because it's not really a derivative with respect to anything obvious. For instance, it's not a derivative with respect to A, because we'd expect the derivative of a thing with respect to itself to be 1, but ad_{A}(A) = AA - AA = 0. In fact, if we take 1 to be the identity matrix, then there's no matrix B such that ad_{A}(B) = 1. We also get that ad_{A} obeys the product rule with respect to the bracket product! Check this yourself. We can write it out as [A, [B, C]] = [[A, B], C] + [B, [A, C]] which is called the Jacobi identity. Note that if we replace our matrices by 3-component vectors and the bracket with the cross product, we get that the cross product also obeys the Jacobi identity! So the cross product is also a derivation, and indeed can be written using matrices in the appropriate fashion. What if we have two derivations, ad_{A} and ad_{B}? What happens if we apply first one and then the other? Well, let's find out: ad_{A}(ad_{B}(C)) = ABC - ACB - BCA + CBA while ad_{B}(ad_{A}(C)) = BAC - BCA - ACB + CAB and we see that these are not the same in general. In fact, we have that ad_{A}(ad_{B}(C)) - ad_{B}(ad_{A}(C)) = ad_{[A, B]}(C) More on this later Consider a vector space R^{n} where we label the variables x_{1}, x_{2},...,x_{n}. We have partial derivatives, d_{1}, d_{2}, etc, i.e. d_{1}(f) is the partial derivative of f in the direction of x_{1}. Define a set of n-by-n matrices T_{i} such that the (j, k) entry of T_{i} is the negative of the (k, j) entry. We define new partial derivatives, D_{1}, D_{2}, etc, such that for the vector valued function f: R^{n} -> R^{n}, D_{i}(f) = d_{i}(f) + T_{i} f You can check that this obeys the product rule for both the dot product and, if n = 3, the cross product. And we can of course extend this to directional derivatives in the usual fashion; write D_{u}(f) as the derivative of f in the direction u. We're used to the fact that d_{i}(d_{j}(f)) = d_{j}(d_{i}(f)), i.e. that it doesn't matter which order we take derivatives in. But this is not the case for D_{i}, because the matrices T_{i} and T_{j} don't necessarily commute. In fact, this order dependence is one way to mathematically define the notion of curvature. We can say that a space, like a sphere or such, is curved if the natural notion of partial derivative on the space has an order dependence. The exact connection takes a little work, but think of it this way: Take the Earth. Start at the North Pole. Take a vector that points along the Prime Meridian (0 degree longitude). Travel down that Meridian until we reach the equator, keeping the vector always pointing due south. Now travel eastward along the equator until we reach 90 degree east, keeping the vector always pointing due south. Now travel up the 90 degrees east meridian until we reach the North Pole again. Which way is our vector pointing now? We started with it pointing down the Prime Meridian. Whenever we moved, we always kept the vector parallel to its previous directions. But when we get to the North Pole again, it's now pointing at a right angle to where it was previously pointing! This is mathematical curvature, this spinning of vectors as we drag them around loops despite doing our best to keep them parallel. The trick is that "parallel" isn't an obvious thing on a curved space. So the best we can do is say "doesn't change direction." But how do we measure change? With derivatives! So our funny new partial derivatives give us a different notion of "parallel" than the usual one, and with this notion of "parallel", we get curvature*. In fact, curvature is measured specifically by looking at d_{i}(d_{j} f) - d_{j}(d_{i} f)** and seeing what happens. For 2-dimensional things, like the surface of the Earth, there's only two independent partial derivatives at any point (we aren't allowed to go up) so there's only one object, d_{1}(d_{2} f) - d_{2}(d_{1} f), and so curvature boils down to just a number, but in higher dimensions there's an number for any pair of distinct directions.*** If we replace n by n^{2}, and only consider constant functions, and pick the T_{i} in the appropriate way, we can actually reproduce the matrix derivations using this, and then the ordering issues become a form of curvature!*** A modified version of the product rule gives you the three big theorems of multivariable calculus, Green's theorem, Stokes' theorem, and the Divergence theorem, along with the fundamental theorem of single-variable calculus, as a single statement, as well as a way to extend upward to any number of variables, but the modification is kind of weird so I'm not going to go into it here. *For things that aren't R^{n}, what we do is we put down coordinates on patches that look like R^{n}, and then glue the patches together in an appropriate manner. So for the surface of the Earth, which doesn't support a global rectangular coordinate system, we break the Earth into two patches, say, North hemisphere and South hemisphere, and in the case we looked at, ignore the Southern half. **plus some other terms so that the end result is a linear transformation that doesn't depend on f. ***This is one way the "rubber sheet" analogy of Einsteinian gravity falls apart, because there is no "down" for the sheet to curve into, rather there are six independent quantities needed to describe the curvature. ****This analogy is exact for spaces called "Lie groups", whose patches can naturally be matched to vector spaces of n-by-n matrices in such a way that the natural partial derivative is ad_A. The most familiar one is space of rotation matrices for 3 dimensions, i.e. the 3-sphere of unit-norm quaternions. Here the patches look like 3-dimensional space and the bracket can be written as just the cross product on 3-dimensions. The corresponding curvature computation tells us that the curvature for any pair of directions is uniformly 1, which is what we'd expect for a unit sphere.
More things that look like derivatives! Algebraic infinitesimals: Suppose we have a teeny, tiny quantity that we'll call d. Teeeeeny tiny. So small that d^{2} = 0. But wait, you say, doesn't that mean that d = 0? Nope. There are plenty of things that are not 0, but when squared yield 0. For instance, a 2-by-2 matrix with a 1 in the upper right corner and 0s everywhere else. We can't do this in the real numbers, however, and so we don't usually talk about this method. Oh, we'll sometimes say "this quantity squared is negligible" but not actually equal to 0, because we don't want the inexperienced to be wandering around carrying dangerous weaponry like infinitesimals. But anyway, suppose we have such a quantity. And suppose we have, say, the function f(x) = x^{n} for some positive integer n. Then the binomial theorem tells us how to find f(x + d) = (x + d)^{n}. The first term is x^{n}, and the second term is nx[/sup]n-1[/sup]d and the third term is (n(n-1)/2) x^{n-2}d^{2} oh wait that's 0, because d^{2} is 0. And the fourth term, which ends in d^{3}, is also 0, because d^{3} = d * d^{2} = d * 0 = 0. And so on. So f(x + d) = x^{n} + nx^{n-1}d. So consider now the quantity (f(x + d) - f(x))/d. For our f, this yields nx^{n-1}, which we recognize as the derivative of f with respect to x! Hey, we've just taken a derivative without using any limits! Except the division by d is a little sketchy, because we can't divide, say, 1 by d. That yields terrible, terrible issues, mostly involving infinities. But if we only look at cases where we already have a factor of d, then we can divide out by d and get a meaningful result. Okay, now suppose that instead of f(x) = x^{n}, we have some polynomial p(x). Then we can again look at p(x + d), and again computing (p(x + d) - p(x))/d yields what we normally think of as the derivative of p with respect to x. And if we have a series, s(x) then formally writing out s(x + d) and computing (s(x + d) - s(x))/d yields the derivative of s with respect to x. So for any function f that is analytic, i.e. equal to its Taylor series, we can find its derivative by evaluating (f(x + d) - f(x))/d, always keeping in mind that d^2 = 0. Alternatively, given an analytic function f we can define its derivative by saying that f(x + d) = f(x) + f'(x)d We (by which I mean Newton) used to do calculus this way, but a lot of investigation into infinitesimals led to people disavowing them in general. They don't act like real numbers in a lot of ways; for instance we can't separate them neatly into positive and negative, because if d is positive, then d^{2} should also be positive, since positive times positive should be positive. Except d^{2} = 0, which is not positive. Also for instance x^{2} - 2x + 1 = 0 ends up with two more solutions: 1 + d and 1- d. So there are some definite issues. They're used in what is called "nonstandard analysis" and are enjoying a bit of a resurgence in "synthetic differential calculus" but are still mostly considered too dangerous to use outside of purely algebraic manipulations like in the above. Note that none of what we did here really depended on the real numbers at all since we never took any limits. We can do this kind of thing looking only at the rational numbers, or the integers, or finite fields, or on general commutative rings. It's all purely algebraic. Fourier Analysis: This one takes some doing, so I'm just going to quickly handwave Fourier series and transformations. So if you have a piano and you play a bunch of keys at once, you get a sound, and that sound is usually written as a wave that goes up and down, representing air pressure as a function of time. What the Fourier transformation does is to take that wave, that function of time, and figure out which keys you pressed and how hard you pressed them. It takes a (function that takes in points in time and spits out air pressure values), and returns a (function that takes in frequency values and returns amplitudes). There is also an inverse Fourier transform that does the opposite: it takes a (function that takes frequencies and returns amplitudes) and returns a (function that takes in points in time and spits out air pressure values). The physical piano implements the inverse Fourier transform, our ears implement the Fourier transform.* Okay, notation-y bit: Suppose we have a function f that is our wave, so f(t) is the air pressure at time t. We write F(f) for the Fourier transform of f, so that F(f)(w) is the amplitude of the frequency w in f. And then we write G for the inverse Fourier transform, so that G(F(f)) = f. So, fun fact about the Fourier transform: suppose we have a function f that has a well-defined derivative. Then we take its Fourier transform and get F(f). Let g(w) be the function w*F(f)(w), w times the Fourier transform of f. Then what is G(g)? It's the derivative of f!** One popular use of this is to solve differential equations. Since derivatives become multiplication by the variable, if we have a linear differential equation D^{n}(f) + aD^{n-1}f + ... + zf = h, where a through z are constants, we take the Fourier transform, which yields w^{n}F(f) + aw^{n-1}F(f) + ... + zF(f) = F(h), and then we get that F(f) = F(h)/(w^{n} + aw^{n-1} + ... + z), so we convert back to get f = G(F(h)/(w^{n}+aw^{n-1}+...+z)) as our solution. So this is another way to get a derivative-looking thing. The reason I say "derivative looking" is that it can be used even for functions that don't have well-defined derivatives. For example, for the function that is 1 on all the rational numbers and 0 on the irrationals, we get a derivative of 0 everywhere, despite the function being very discontinuous. Okay, now for something weird. If we look at g(w) = w^{2}F(f)(w), we get that G(g) is now the second derivative of f. Defining g(w) = F(f)(w)/w gives that G(g) is, as one might expect, an antiderivative of f. So defining g(w) as w^{n}F(f)(w) gives us that G(g) is the nth derivative of f. What about if we look at g(w) = w^{1/2}F(f)(w)? What about g(w) = w^{ln 2}F(f)(w)? Are these, respectively, the (1/2)th derivative and the (ln 2)th derivative of f? Apparently so! These are called "fractional derivatives". They also play a role in solving differential equations, but much more subtle than integer derivatives. So now we have four different generalizations of derivative. We have the matrix one, which I would call the "representation theoretic derivative", we have the partial+linear transformation one, which I would call the "differential geometric derivative", we have the algebraic infinitesimal one, which I would call the "commutative algebraic derivative", and we have the Fourier Analysis one, which I would call the "harmonic analytic derivative". Unlike the previous two, these latter two are actual derivatives, in that on certain types of real-valued functions of one variable they give the actual derivative that you would get in single-variable calculus. But they also can do other stuff. *Each of these implementations is a little lossy, because piano notes don't go on forever, and we have a bounded range of frequencies that we can hear. **Well, there's a multiplicative constant floating around that depends on the exact version of the Fourier transform in use. It's not really all that important.
psst there's [sup][/sup] and [sub][/sub] bbcode tages for ^{superscript} and _{subscript} respectively. #in case you're sick of the caretspam