Today I learned that if you don't know about primes but you do know about even and odd, then you can prove that the square root of 5 is irrational. Similarly for all non-square integers up to 17. At 17 you need something more. See this twitter thread by Andrés E. Caicedo for details.
I want to start telling people that there exists a thing that might be called "non-deterministic linear algebra" and then point them at Sheldon Axler.
Slightly off of pure mathematics, here is a paper on the longest possible chess game. It's kind of great.
Homomorphic Encryption Suppose, purely hypothetically, you and a bunch of other people need to jointly make a decision between some number of choices that will affect all of you. Obviously, you should be the one making the decision, but the rest of them don't trust your judgment for some reason. Ludicrous, of course, but bear with me. Fortunately, you're pretty sure that most of the group would agree with your choice so you suggest taking a vote. But the other guys are untrustworthy and might try to cause trouble, for instance by threatening to punish those who vote the "wrong way". So we want to hide how people voted. The modern answer to that problem is encryption. Unfortunately, most forms of encryption deliberately mangle structure, so that Enc(1)+Enc(1) is not equal to Enc(2). A lot of the time we kind of want this, because if you send someone a message saying "Account 1 pays X to account 2", you don't want someone to be able to add two of those messages together to get "Account 1 pays 2X to account 2", or "Account 1 pays 1000X to account 2". But that's not the only way to do things. Sometimes you want Enc(1) + Enc(1) = Enc(2). Like if you want to be able to tally votes without actually seeing what the individual votes are. How do we do that? Well, let's look at the El-Gamal method as an example. Suppose we only want to know about whether the voters approve a particular proposition. We start with a big cyclic group G, and a particular element g. And the person running the election picks a number s and computes K = gs, and they tell everyone K but don't tell anyone s. Now each voter says "do I like this proposition?" And if they do, they pick a random number r and compute EncK(m, r) = (gr, gmKr) where m is 1 if they like the proposition and 0 if they don't. And they submit that to the person running the election. Now there are two useful properties: First, if you have EncK(a, r1) = (gr1, gaKr1) and EncK(b, r2) = (gr2, gbKr2), then multiplying them together component-wise gives (gr1+r2, ga+bKr1+r2) = EncK(a+b,r1+r2) So you can combine two votes and get the sum, i.e. we have a homomorphism, sort-of. The random numbers r provide a bit of interference. But that's taken care of by property 2. Second, the person running the election can take a ciphertext (A,B) and compute Decs(A,B) = B/As which, if A = gr and B = gmKr, gives gmKr/grs = gmgsr/grs = gm which gets rid of the randomness from r. Now if m is constrained to a small enough number, i.e. if there aren't too many voters, then the person running the election can just test to see what value of m gives gm = Decs(A,B) We call this homomorphic encryption: we have an encryption scheme (Enc, Dec) such that the decryption operation Dec is a homomorphism. Note that Enc isn't really the homomorphism we care about, because of the randomness. Why the randomness? Well, suppose we didn't have it. Then each submission of EncK(1) would look exactly the same. Then once the votes were tallied, then people know the total votes for the proposition, and can say "well, everyone either submitted a vote encrypted as A or a vote encrypted as B, and there were m votes that were encrypted as A, so those must correspond to votes for the proposition." Which defeats the purpose of encrypting things in the first place. So instead we have the randomness, which means that even if two votes are both for or both against the proposition, their encryptions look different. This is called semantic security: people can't tell how you voted by comparing your vote to other people's votes; they have to actually be able to decrypt in order to find out. There are a few directions to go from here. One of them is how to prevent people from cheating, either by the voters submitting fraudulent votes (for instance, if you send EncK(10000000, r), then you violate the "one man, one vote" principle) or by the system changing the votes during transmission, or the person running the election choosing to decrypt individual votes before tallying. There are a number of proposals for dealing with these issues, mostly involving Zero-Knowledge-Proofs, basically all Schnorr protocols or variants thereof. Another direction is asking, "okay, we can add underneath encryption. Can we multiply as well? Can we do general computation?" And the answer is yes, but in a ridiculous fashion. Let's slightly change the scenario a bit. Suppose you have a bunch of secret data that you want to do a computation on, but you don't have the compute power. You could dump it all on a server hosted by a book distribution company, in some hypothetical world where a book distribution company has significant compute power for sale, but you don't trust said book distribution company with the data. So instead you want to encrypt the data, send the encryptions to the company, have them do the computations underneath the encryption, and then send the encrypted results back to you, which you can then decrypt. Other than preventing a book distribution company from harvesting your data, this also has applications to, say, medical studies with privacy issues, where the researchers don't want their subjects' medical information to be given directly to those perfidious statisticians, but they would like to find out what the statistical models say. Unfortunately, the El-Gamal scheme I showed you doesn't really like encrypted multiplication as much as it likes encrypted addition, and that makes sense because it involves exponentials. In general, trying to make an encryption scheme that allows for both addition and multiplication underneath the encryption is pretty hard. Instead, you can do some fun stuff with lattice encryption, which gives you homomorphisms, sort-of. Lattice encryption involves including noise terms that might cause the decryption to fail if it's too large. I mean, if you have a message m, then Dec(Enc(m)) will always yield m. The problem is that as you try to do encrypted operations, the amount of noise builds up until it gets to the point where you can't reliably decrypt anymore. We say that we have somewhat homomorphic encryption, since we can only do a small number of operations between decryptions. So what you do when you think you might be reaching that point is you decrypt what you have, getting an exact answer, and then encrypt again, getting a new encryption with only a small amount of noise. But wait, you say, you don't want to decrypt in the middle; that lets the book distribution company see what you're doing. That's a good point. So instead we're going to reverse the order of operations. Imagine encryption like wrapping something in an envelope, and decryption like removing the envelope to get to the message inside. If we have a message in an envelope, we don't want to remove the envelope because then people will see the message. So what we do is we wrap the enveloped message inside a second envelope, and then remove the inner envelope. Just, think about that image for a moment. One of the reasons I like cryptography is because it does some absolutely ludicrous things, and this is one of them. So suppose that, after some computation, the book distribution company has C1, which is encrypted with a public key K1 that has a corresponding secret key s1, and if C1 were decrypted with key s1 it would yield m, but it's got a lot of noise. Now what we do is, we take a second secret key s2 and public key K2, and we encrypt s1 using K2 to get S, and send that to the book distribution company. The book distribution company takes C1 and encrypts that using K2 to get C'. Then they homomorphically run the decryption operation on C' using the "key" S to undo the encryption by K1, all of this underneath the encryption by K2, and yields C2, which decrypts to m if you use key s2. And now we've got our message in a nice, shiny new envelope, without ever revealing it to the book distribution company. Note that the book distribution company never sees s1 or s2, all they see is the encryption of s1, so they can't actually decrypt C1 to get m, they can only homomorphically decrypt an encryption of C1 to get an encryption of m. I'm omitting a lot of details necessary to ensure that C2 actually has a smaller amount of noise than C1, but whatever. This process is called, appropriately, bootstrapping. This is some Cantor-style nonsense, encrypting the encryption scheme inside itself in a way that successfully runs.
On which large cardinal axioms imply which other large cardinal axioms. Note the guy at the top. Technically not a declaration that a particularly large number exists, but rather just noting that as you include more and more axioms for the existence of large cardinal numbers, the amount of stuff you can prove increases, and at some point you end up being able to prove a contradiction.
Checking Two Bird with One Stone Suppose you have two functions f(x) and g(x), and you're looking at the points where f(x) is 0, and the points where g(x) is 0. It's not difficult to make a function h(x) that is 0 at the points where f(x) is 0 OR where g(x) is 0. Specifically, we can set h(x) = f(x)g(x) and you can see that if either f(x) is 0 or g(x) is 0 then h(x) is 0. Finding a function that is 0 on exactly the points where both f(x) and g(x) are simultaneously 0 can be a bit trickier. If we're working over the real numbers then we can set h(x) = f(x)2 + g(x)2 since f(x)2 and g(x)2 are both nonnegative, so if either is positive then the sum has to be positive. If we're over the complex numbers, however, then f(x)2 might be positive, or negative, or imaginary, or indeed anywhere on the complex plane. So we have an interesting problem here. There are a number of ways to investigate this issue, but I'm going to go probabilistic, partially because the other method I know of involves algebraic geometry and that's an arbitrarily deep rabbit hole. Let's consider f(x) and g(x) again. Suppose that for some x, f(x) + g(x) is 0, and that f(x) + 2g(x) is also 0. Then we get that f(x) and g(x) must both be 0. Indeed, if we have that f(x) + ag(x) = 0 and f(x) + bg(x) for distinct values of a and b, then we get that both f(x) and g(x) must be 0. But we still want one function h(x). So let's introduce a point y, picked at random. Then we set h(x) = f(x) + yg(x) Suppose that g(x) is nonzero. Then we get that h(x) is 0 if and only if y = -f(x)/g(x) So there's a chance that h(x) ends up being 0 without either of f(x) or g(x) being 0, which is kind of bad for what we want. But! What is the chance that we picked that specific point for y? Well, that depends on what set we're picking y from. But we can make the probability small by picking from a large set. Okay, let's generalize a bit. Suppose that we have a bunch of values p1, ... , and we want to check if they're all simultaneously equal to 0 by checking a single number. Again, if we know that the values are all real, we could simply square each one, add the squares together, and check that the sum of squares is 0. But if the values are complex then this doesn't work. Instead, we're going to build a polynomial. We take some points, y1, ..., yn, each randomly chosen, and we multiply each of our pi by a different monomial y1d1,iy2d2,i...yndn,i, and we add these together to get some value P(y1,...,yn) = Σi pi y1d1,i ... yndn,i Then P is 0 with low probability unless all of the terms are 0. What is the probability? Well, let d be the largest total degree of the monomials in P, i.e. the largest value that d1,i + d2,i + ... + dn,i takes. And suppose that when we're picking our values for the yi, we pick them from some fixed, finite set S. Then the Schwartz-Zippel lemma states that Pr[P(y1,...,yn) = 0] ≤ d/|S| i.e. the probability is bounded by the total degree divided by the size of the set we're picking points from. Note that this is doesn't depend on us working over the real or complex numbers; we could be doing this in any field, for instance over finite fields. Also note that this probability doesn't depend on the number of variables. Let's look at two specific cases: If we set n = 1, then we get that the probability that a polynomial takes the value 0 at a point y is at most the degree of the polynomial divided by the size of the set S we're picking points from. We can check this on its own: any polynomial of degree d has at most d distinct roots, and so if S contains all d of those roots then the chance of picking one of those roots randomly from S is of course d/|S|, and if S contains fewer of those roots then picking randomly from S must have a smaller probability of picking a root. If we instead set d = 1, we get that the probability that a linear combination of random points from |S| takes the value 0 is 1/|S| unless the coefficients of the linear combination are all 0, since each term in the linear combination with nonzero coefficient independently takes on |S| different values, and so the probability that each such term cancels out with the rest of them is at most 1/|S|. So if we can have S be large compared to d, then we can say that our values pi are probably all 0 if we get that P(y1,...,yn) is 0 for yi randomly chosen from S. That allows us to check a bunch of conditions simultaneously by checking a single value, at the expense of the check being probabilistic.
Eigenfunctions of the Fourier Transform On my blog I like to complain that about once a year or so I'm forced to learn something new about the Fourier transform due to my research. Here's something that I learned recently. The Fourier transform is often described as a way to take something that is a function of time, f(t), and replace it with something that is a function of frequency F(ω). The easiest way to illustrate this, in my opinion, is to talk about playing a piano: if you have the sound produced by someone playing the piano, the Fourier transform tells you what keys they were playing. But! There are actually a bunch of different ways to interpret what the Fourier transform is doing. For instance, if you have a differential equation, you can use the Fourier transform to try to solve it, based on the fact that given a function f(t), the Fourier transform of f'((t) is ωF(ω), and thus you can sometimes turn a differential equation into a polynomial equations. For concreteness, I'm going to define the Fourier transform as F(ω) = ∫ f(t)exp(2πiωt)dt where the integral is over the entire real line. Why this version? Because this version has the nice property that applying the transform twice gets you a function that looks like f(-t). Also i don't have to divide by any square roots. But what about if you don't distinguish between time and frequency? What if you just let the Fourier transform be just some linear transformation that sends functions to functions? It turns out that if you apply this version of the Fourier transform four times, you get back to the original function, so all of its eigenvalues are 1, -1, i and -i. One of its eigenvectors is somewhat recognizable: the Gaussian distribution G(x) = exp(-πx2) is an eigenvector with eigenvalue 1! Moreover, if you take derivatives of G(x), you get more eigenfunctions with eigenvalue 1, and they all look like P(x)G(x) for polynomials P. We can furthermore pick particular ones that are nice in the sense of being orthogonal. What do we mean by "orthogonal" here? Well, for vectors we usually mean that Σi viwi = 0 but here we don't have a finite number of coordinates; we need to compare two polynomials at all points on the real line. So instead we put in a weight function and write that two polynomials P and Q are orthogonal if ∫ P(x)Q(x)G(x) dx = 0 So the statement is that if we take derivatives of G(x) and divide by G(x), we get a bunch of orthogonal polynomials! This gives us the Hermite polynomials*, one of the so-called "classical orthogonal polynomial sequences". More recently I wondered how this looks for the Discrete Fourier transform. Since the Discrete Fourier transform only requires a finite number of values, it's often written as a matrix transformation on a finite-dimensional vector space, implemented via a matrix [1 1 1 1 ... ] [1 ζ ζ2 ζ3 ...] [1 ζ2 ζ4 ζ6 ...] [1 ζ3 ζ6 ζ9 ...] [... ... ... ...] where ζ is a complex n-th root of unity, although to get the proper normalization we have to multiply the above matrix by 1/√n. Now we can ask about eigenvalues and eigenvectors of this matrix. Again we get that the eigenvalues are all 1, -1, i and -i. Moreover, the number of times each appears is well-understood: If n = 4m, then 1 appears m+1 times, -1 and i appear m times, and -i appears m-1 times.** If n = 4m+1, then 1 appears m+1 times, -1, i and -i all appear m times. If n = 4m+2, then 1 and -1 appear m+1 times, i and -i appear m times. If n = 4m+3, then 1, -1 and i appear m+1 times, and -i appears m times. However, it turns out that the eigenvectors aren't as nice as in the continuous Fourier case. There are various eigenvectors for eigenvalue 1, but we can't take derivatives anymore because we're in the discrete case, and so there's no longer a "canonical" set of basis eigenvectors. There are, however, a lot of people putting forth proposals for what to use as the basis eigenvectors, using various criteria for what makes one set best, or at least good. * Note: there are at least two standard conventions for how to define the so-called Hermite polynomials, giving two different sets of polynomials, and I'm using yet another convention and getting yet another set of polynomials. Too bad, physicists and probabilists; I do algebraic representation theory. ** Technically you might need to swap i and -i here depending on what ζ is, but the point is that the two imaginary eigenvalues don't always align in terms of their multiplicities.
Does that mean that the Gaussian is the simplest nontrivial function that is its own Fourier transform? Is that what makes the normal distribution normal?
It's one of the many nice properties of the Gaussian. I hesitate to say that it is the reason that the normal distribution is normal; for instance, a probabilist would point to the Central Limit Theorem as the reason to care about Gaussians, and I have no idea if there is a connection. For a quantum physicist, however, looking at a quadratic potential, we have -(d/dx)2phi(x) + x2phi(x) = 0 and taking the Fourier transform turns the (d/dx) into x and the x into (d/dx) and we get basically the same equation for the Fourier transform, and so we would ask for phi(x) to be its own Fourier transform, hence why we end up with Gaussian wavefunctions.
Two computationally-bounded programs that can read each other's source code play the one-off Prisoner's Dilemma. What happens?
I'm just here to complain a moment, all in good humor, because category theory is one of the most entertainingly exasperating topics I've ever tried to read about. I just wanted to know what it was, and then I was in too deep and could not be saved. I've never been so puzzled about diagram notation when the diagram is one fucking square. The relationship between these elements seems to make complete sense until they go on the square, then it's mayhem. There's something very funny about wailing over the meaning of an exclamation point with an arrow that is not dashed. You should have seen my face when I navigated down a comically long series of Wolfram Alpha links to one that finally explains wtf a fiber is, only to discover it's every x that gives the same y value? Why didn't you just say so!!! It was the magnanimous smile of someone who has decided to assume the set of all damn good reasons for this runaround is not empty, because otherwise it simply does not track that there aren't a lot more fist fights in higher math. I've never been in a fist fight, but in light of everything I've learned so far, it seems like a proportionate response. I'm guessing it has to do with how it would be most natural or useful to conceptualize a thing within a particular context, but it's pretty sus is all I'm saying. At least "fiber space" is an amazing name. I don't need to know any of this. I'm not planning to use it for anything. I don't even know what it is or what it can do, so how *could* I want to use it for anything? I'm just so used to being able to coast well enough by looking up any unfamiliar notation and playing it by ear that I'm having an absolute blast getting mad about something where that really isn't gonna cut it. I got so fed up I decided the obvious thing to do was learn about Gödel numbering in the wee hours of Saturday night? I dunno, at least those made the kind of sense I could understand.
So the problem with category theory is that they don't want to talk about points, they want to talk about maps. Or rather, they want to talk about morphisms, which are like maps in that they have a domain and a codomain and can be composed, but otherwise don't necessarily have anything in common. A fiber is a thing whose morphisms satisfy some properties. What is a fiber itself? Who cares, its morphisms have these properties. If we only care about the properties of the morphisms, we can greatly expand the notion of "fiber" to other contexts. We want to be able to talk about "fibers" in contexts of morphisms whose preimages don't live in our category because they need more points, or fewer points, or different points. We want to be able to talk about "fibers" in the contexts of morphisms between things that don't have any points, or don't have enough points, or have too many points. We want to be able to talk about "fibers" in the contexts of morphisms where "preimage" doesn't compose well, i.e. that the preimage of f of the preimage of g isn't exactly the preimage of (f composed with g). We want to be able to talk about "fibers" in the contexts of morphisms that definitely aren't "maps" in any sense, or at least be able to ask about fibers, even if the fibers don't exist. Like, consider a category where the objects are natural numbers and we have a morphism from x to y if x divides y and nothing otherwise. This is a category! Those aren't maps! Numbers don't have points! But because the notion of "fiber" doesn't need maps, or points, we can ask if the morphism from 4 to 8 has fibers, and if so, what those fibers are. The answer is no (I think), this category doesn't have fibers in general, but we can at least ask the question! Category theory is meta-programming . It's generics, macros, traits, what-have-you. "Fiber" isn't a type of thing, it's a collection of attributes and methods looking for a user-supplied type parameter that fits a number of requirements that in turn are only given in terms of attributes and methods. Implementation details are for set theorists. I don't actually know if you need any of it for functional programming, but all the stuff about monads and such come from category theory. This is the mathematical language that underlies higher-order programming, and the theory behind (strong) type systems.
Speaking of, one of my friends in grad school was a probabilist, and didn't get along with this algebraic geometer. One time during a colloquium the probabilist was saying that he didn't need to know category theory, and the algebraic geometer just pipes up from across the room "a random variable is just a pushforward!" and my probabilist friend got hilariously upset. Another fun moment was when I was taking real analysis in undergrad and the professor, who had no business teaching undergrads but whatever, was asked about the chain rule. He just stared at us for a moment as if we were the dumbest group of babies he had ever seen, replied "it's a functor" and moved on without another word on the topic.