yeah, so far we've really only done super basic stuff, we've done up to... ricardi i think? or however you spell that, my prof says it like "ricotti" but thats a cheese. i'm interested to see where it all goes because as of now it all seems pretty intuitive
The islands of solvability appear large at first, able to sustain enough problems that you turn your back on the vast, uncaring sea around you. Although ODEs aren't too bad, actually; a frankly surprising amount of stuff works or at least quickly indicates failure. It's when you get to PDEs that things start going bonkers.
Expensive Donuts Okay, so let's talk about elliptic curves. Consider the following equation: E: y^{2} = x^{3} + ax + b for some numbers a and b. For the moment we'll be nice and pick a and b such that the three solutions to x^{3} + ax + b = 0 are all distinct, which is equivalent to saying that 4a^{3} + 27b^{2} is nonzero. We'll be further nice and say that we're only working over the real numbers*, although we won't insist that x^{3} + ax + b = 0 have all real solutions (it must have at least one real solution since a and b are real). The equation E defines a curve, in that there is roughly one free parameter to the set of solutions which I will call E(R)**. Sometimes E(R) is a single curve on the x-y plane, sometimes it is a curve and a closed loop. In either case, it is symmetric across the x axis, in that (x, y) is in E(R) if (x, -y) is. Also, for any given value of x there are only two possible values of y such that (x, y) is in E(R). We can look at how straight lines interact with E(R). Generically, a term that I will explain in a moment, if you have a straight line that passes through two distinct points on E(R) then the line must also pass through a third distinct point on E(R) and no more. There are a few cases where this isn't quite correct as stated: a): If the line is tangent to E(R) at one of the two points, then there is no third distinct point. In this case we'll say that the point where the line is tangent counts for double, so the line has three points total on E(R), it's just that two of them look the same. b): If the two points have the same x-value, then the two points are (x, y) and (x, -y) for some x and some y. Here the two points need to be treated identically, so instead of saying that one point counts for double, we instead to the usual algebraic geometry trick of adding a point at infinity that we'll call O. So what about lines that are tangent to E(R)? Here it gets a little tricky. For most points on E(R), a tangent line to E(R) at that point also passes through E(R) at a distinct point. This now includes the cases where the tangent line is vertical, in which case the distinct point is O. However, there is the possibility that the tangent line does not pass through any more points on E(R) including O. This happens at the inflection points, i.e. the points where, near the point of tangency, E(R) passes from one side of the tangent line to the other, like a driver going to from turning left to turning right, or vice-versa. Those points count triple for the tangent line, so again the line has three points on E(R), it's just that all three look the same. We also consider a "tangent line" at O where O counts triple for that line, just for completeness. So this is the general result: every line either passes through E(R) at zero points, at one point, or at three points, although in the three times case some of the points might look the same. Well, that's all fine and dandy, but why should we care? Because we can add! We take two points A and B on E(R), not necessarily distinct, and say: there is a third point C on E(R). We then say that A + B = -C, where -C has the same x-coordinate as C but the opposite y-coordinate. We should also say that O = -O, because that's how points at infinity work. So now we have a notion of addition. It's pretty clearly commutative, A + B = B + A because both sums use the same line. Also pretty easy to show is that A + -A = O and that A + O = A. What isn't so obvious is that if you have three points, then (A + B) + C = A + (B + C). But it's true! So what you end up with is that E(R) with this addition operation is a group! Okay, but so far that's all been just playing around with geometry. Now let's have some fun. Take a field F where you can divide by 2 and you can divide by 3. Most fields allow for these, unless you're a computer scientist. The point is, if we pick a and b to lie in F then we can ask for solutions to y^{2} = x^{3} + ax + b where x and y lie in F, and again we get a "curve" E(F). And again any "line" passes through E(F) at zero points, one point, or three points, although again in the three points case some of the points might look the same. Depending on what F is, we might need to be careful about what we mean by "tangent". But we still get the same interactions between lines and E(F), and that means that we can still define the addition operation and get a group! And if we're really worried we can ditch the geometry altogether and define the addition algebraically via coordinates, making a few extra rules for the special cases. These are quite different from the groups we usually get from looking at symmetries of things. For instance, they don't really turn into sets of matrices that easily, because despite being commutative they're still kind of complicated. I promised two things: expensive and donuts. Expensive: If F is a finite field, then E(F) contains finitely many points, and hence is a finite abelian group. Finite abelian groups are well-understood, in that they all can be written very simply in terms of modular arithmetic, which is so easy that clocks can do it. However, what is not easy is taking the description of E(F) as an elliptic curve, and figuring out how to describe it in terms of modular arithmetic. In particular, you want to be able to take a point on the curve and ask "what time is it here?" This is hard, as far as anyone can tell, in the computational sense of being NP. And since it is hard, but is also algebraic, people use it in cryptography, and it allows for similar levels of security using smaller keys than schemes using, say, just finite fields. That "as far as anyone can tell" is important, because that means that someone might be able to do it easily, and if they can, that means that all of that nice cryptographic security goes poof. So of course crypto people are interested in whether it is actually hard or whether we're just not being clever enough about it. Also expensive: If you take F to be Q and take a and b to be integers, then we get that E(Q) decomposes as a finite abelian group and some number of copies of Z. We call that number the rank. We can also look at the same a and b and look at E(F) for various finite fields F. The data of the number of elements of E(F) for all finite fields F can be put together into a complex analytic function L(E, s) which looks like (s-1)^{k}f(s) for some integer k and some nonvanishing function f when s is close to 1. If you can prove that k is equal to the rank then you get a million dollars. Yes, this is the Birch and Swinnerton-Dyer conjecture of the Millienium Problems of the Clay Math Institute. I'm sure @Emu can explain where L(E, s) comes from and why arithmetic geometers might care. Finally, donuts: If you take F to be the complex numbers C, then for any complex a and b where 4a^{3} + 27b^{2} is nonzero yields an elliptic curve E(C) that, if you include the point at infinity, is homeomorphic to a torus. It's a torus that sits in 4-(real)-dimensional space and one point is off at infinity, but other than those minor details it's a torus. *This is the point where algebraic geometers remark that working over the real numbers rather than the complex ones is actually kind of mean. **By curve we really mean "you start with two free variables and add one nontrivial constraint". We use this definition because we want to be able to define curves without worrying about geometry.
One of the best things about math is that while we've solved a lot of simple problems, and we've solved a lot of complicated problems, there are still lots of simple problems that we haven't solved. For instance, consider Pascal's triangle, which, like many things, was not originally discovered by Pascal, and was known throughout a lot of the ancient world. The triangle starts out as: 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 etc. There are a lot of different ways to describe the triangle and to figure out what a given entry in the triangle should be. For instance, you can say that each number in a row is the sum of the entry directly above and the entry diagonally up and to the left of it, except for the 1s on the ends which we just assume are always there. We can also say that the kth entry in the nth row is the way to choose k-1 objects out of n-1 objects (indexing from 1). But despite knowing a lot about individual entries, the overall structure still contains some mysteries. David Singmaster made the following conjecture in the 70s: Singmaster's conjecture: No number bigger than 1 appears more than 12 times in Pascal's triangle. Simple to state, the conjecture to this day remains unproven. As yet there are no known numbers bigger than 1 that appear more than 8 times, and only one known that appears 8 times: 3003.
I'm just gonna put this here: "Medical researcher discovers integration, gets 75 citations" It's not satire.
A bunch of us found that in grad school. It's the saddest thing, because she didn't even figure out actual integration; she just figured out the trapezoid rule. Also, totally didn't do anything to alleviate the common stereotypes about biologists that tend to float around math departments.
It gets better. Yes, she compared her method to the "true value" she found by using the counting-the-squares-underneath-the-curve method.
Reminder to self to attempt explaining the Riemann hypothesis and why anyone actually cares about it. Also tell the story about puzzles written on blackboards in the tea room.
The Riemann Hypothesis Probably the most famous currently unsolved problem in mathematics, the Riemann Hypothesis is one of several equivalent statements that, if proven true, will tell us quite a bit all at once. In fact, it's so useful that there are a number of papers that simply start with "Assuming the Riemann Hypothesis" and go from there. So what is the Riemann Hypothesis? Let's start with a series. Consider the series that starts as 1 + 1/2 + 1/3 + 1/4+... = Σ_{n positive integer}1/n If you've taken calculus, then you'll know that this series, called the Harmonic Series, goes off to infinity, but does so pretty slowly. We can now define a function h(s) given by h(s) = 1^{s} + 1/2^{s} + 1/3^{s} + 1/4^{s} + ... = Σ_{n positive integer}1/n^{s} So for instance, h(1) is just the Harmonic Series, and h(2) is 1 + 1/4 + 1/9 + 1/16 + ... = Σ_{n positive integer}1/n^{2} We can show that if we take s to be a real number greater than 1, then h(s) has a finite value, and if we take s equal to 1 or below, then h(s) diverges. Note that even if s is a complex number, as long as the real part of s is greater than 1 then h(s) has a finite value. Unfortunately, h(s) in this form doesn't cover the case that we care about. So instead, we define h(s) by the equation: (1 - 2/2^{s})h(s) = 1 - 1/2^{s} + 1/3^{s} - 1/4^{s} + ... Unlike the original expression for h, this one makes sense when the real part of s is between 0 and 1, which is the important region of the complex plane. The Riemann Hypothesis (or at least what it's been reduced to at this point) states that, if the real part of s is greater than 0 and h(s) = 0, then the real part of s is 1/2. It's kind of a weird statement as written. But it's important to people who care about prime numbers, because it talks about the growth of primes in the following sense. The Prime Number Counting Function π(n) is the number of integers between 1 and n that are prime. The Prime Number Theorem tells us that π(n) is approximately n/ln(n). If the Riemann Hypothesis were true, it would tell us a lot about how approximate that approximation is. We have a bound for how bad the approximation can be, but the Riemann Hypothesis would give us a much tighter bound, and because integers are discrete, tight bounds can give a lot of information for where something is; if you know that your integer is between 2 and 4 exclusive, you know that you're looking at 3. So that's a statement of the Riemann Hypothesis. But that's not a lot, so here are some more ideas: We extended h(s) a little bit, by rewriting it from just being a series to being a series and a multiplying factor. Why is this the correct extension of h(s) and can we extend it further? Suppose that we have a function f that takes points on the plane and spits out points on 2-sphere, and suppose that this function is continuous. Then for a line L on the plane, f(L), the points to which f maps the points of L to, is a curve of some sort. Now suppose we had two lines, L and M, on the plane, and they intersected at a point p. Then the curves f(L) and f(M) intersect at the point f(p) on the sphere. We can compute an angle between L and M at p, and we can compute an angle between f(L) and f(M) at f(p). If for all choices of L, M and intersection points p, the angle between L and M at p equals the angle between f(L) and f(M) at f(p), then we say that f is conformal, which is just a fancy way of saying angle preserving. Now let's consider the plane to be the complex plane, and the sphere to be the complex plane plus a point at infinity, i.e. CP^{1}. Then we would say that our function f is meromorphic. The fun thing about conformal/meromorphic functions is that they're quite rigid. If you have a region of the plane with nonzero area and you know what a meromorphic function does on that region, then you know what it does on the entire plane. This is the notion of analytic continuation: you specify a meromorphic function on a region of the complex plane, and then the analytic continuation of that function is the unique meromorphic function defined on teh entire plane that agrees with the original function on the region. The claim is now that h(s) as originally defined is meromorphic on the region Re(s) > 1, and that the extended version of h(s) is still meromorphic. If we take the analytic continuation of h(s) to the entire complex plane, we get the proper Riemann Zeta function, ζ(s). Using some work, we could show that if s is a negative even real integer, ζ(s) = 0. But we don't really care about those 0s because they don't actually tell us much, so we call those the trivial 0s of the Riemann Zeta function. So the Riemann Hypothesis only talks about the nontrivial 0s, and contends that the nontrivial 0s all lie on the line Re(s) = 1/2. So why does the Riemann Zeta function tell us about primes? Well, let's go back to the original form of h(s). A little bit of algebra shows that (assuming that Re(s) > 1), 1/h(s) = (1 - 1/2^{s}) * (1 - 1/3^{s}) * (1 - 1/5^{s}) * (1 - 1/7^{s}) * ... = Π_{p prime} (1 - 1/p^{s}) So we have a product of factors, one for each prime number. As you might expect, this doesn't quite work so well outside the specific region of convergence, but it shows a connection. The other connections are rather deep and I don't claim to understand them. To be honest, the Riemann Hypothesis is probably the Millenium Problem farthest from my area of mathematical expertise; I'm not all that good at analysis or number theory, and the Riemann Hypothesis lives in the realm of analytic number theory. But at least it isn't the Hodge Conjecture or the Yang-Mills Mass Gap problem, neither of which I think I could properly explain the statements of.
Let's go back to ordinals. We can talk about ω, which we can view as the set of {0, 1, 2, 3,...} with the usual ordering. Note that in terms of cardinality, ω is countable. We can then define ω + 1 to be the set {0, 1, 2, 3,..., ω}, where we say that for all natural numbers n, n < ω. Note that this is distinct from 1 + ω, which we would write as {-1, 0, 1, 2, 3, ...}, because in ω + 1 there is an element that's bigger than an infinite number of things, but in 1 + ω, -1 is not bigger than anything, and everything else is bigger than only a finite number of things. In fact, 1 + ω = ω. Also note that in terms of cardinality, ω + 1 is the same size as ω, i.e. countable. We can then define ω + 2 to be the set {0, 1, 2, 3,..., ω, ω+1}, and I think you get the idea. Eventually we get to ω + ω, which we define to be the set {0, 1, 2, 3,..., ω, ω+1, ω+2, ω+3,...} with the ordering that you'd expect, and we can write that as ω*2. Again, this is distinct from 2*ω, which instead would be {0_{0}, 1_{0}, 0_{1}, 1_{1}, 0_{2}, 1_{2}, 0_{3}, 1_{3},...} and this ends up being the same as ω. But regardless, ω*2 is still countable. But moving on, we can imagine ω*3, and ω*4, and so on. Eventually we get to ω*ω, which we read as ω^{2}. It's ω copies of ω, with everything in each copy being bigger than everything in the copy before it. And we can think about ω copies of ω^{2}, giving us ω^{3}, and so on. And all of these are countable. And eventually we reach the set ω^{ω}, which is ω copies of ω copies of ω copies of .... but is still countable. And there's a bunch of ways to think about that. But even better is a concrete realization of ω^{ω}. So let's go back to the natural numbers. Here we're going to only look at the positive integers and we're going to screw with the ordering. We start with 1, and then say that the next number is 2, and then next is 4, then 8, then 16, etc. This gives us a copy of ω. Next is 3, then 6, then 12, then 24, etc. This gives us another copy of ω. We say that if a = 3*2^{j} and b = 2^{k}, then regardless of what j and k are, a is considered greater than b. Then 9, then 18, then 36, then 72, etc. This gives us another copy of ω, and we say that 9*2^{j} > 3*2^{k} > 2^{l}. Continuing this pattern through the powers of 3 gives us ω^{2}. Now we introduce 5s. We start with 5, 10, 20, 40, ..., 15, 30, 60, 120, ..., 45, 90, 180, 360, ... and say that 5*3^{j}*2^{k} > 3^{l}*2^{m}. This gives us another ω^{2}. Then adding in 25, 50, 100, 200, ..., 75, 150, 300, 600, .... adds in another ω^{2}, and so on. Continuing this pattern through the powers of 5 gives us ω^{3}. And so we continue this pattern, adding in larger and larger prime factors. Adding in powers of 7 gives us ω^{4}, powers of 11 gives us ω^{5}, powers of 13 gives us ω^{6}, etc. So once we've gotten in all of the primes, we end up with ω^{ω}. But each element of our setup is just a finite positive integer written in terms of its prime factors. Sure, the ordering is really weird, but as a set the elements just correspond to natural numbers. So here is an explicit realization of ω^{ω} where you can tell exactly which numbers any given natural number is "greater than". Of course, why stop here? We can make more copies of ω^{ω}, getting ω^{ω}*2, ω^{ω}*3, ..., ω^{ω}*ω = ω^{ω+1}, ..., ω^{ω+2}, ..., ω^{ω*2}, ..., ω^{ωω}, ...., ω^{ωωω}, ... until we get an infinitely tall stack of ω, at which point we should probably change our notation.
A quick note on the Euler Polyhedral Formula: In the post on the Hairy Ball theorem, I noted that you can tell whether you can comb the hair on an n-dimensional sphere by computing the Euler characteristic, and that you can compute the Euler characteristic by building the sphere out of pieces and counting the pieces of each type. That gives us a statement about polyhedra: Consider a cube. One could consider it a badly-modeled 2-sphere, or one could consider inflating the cube like a balloon until it's spherical. Either way, we can count its 0-, 1-, and 2-dimensional pieces. It has 8 vertices, 12 edges, and 6 faces. So the Euler characteristic is 8 - 12 + 6 = 2. If we do the same counts for the tetrahedron, we get 4 - 6 + 4 = 2, and similarly for the other regular polyhedra: 6 - 12 + 8 = 2 for the octahedron, 20 - 30 + 12 for the dodecahedron, and 12 - 30 + 20 = 2 for the icosahedron. In fact, for any convex polyhedron, regular or not, obeys the same formula. This is because the Euler characteristic is a property of topological spheres, so if you can inflate your polyhedron so that it looks like it was drawn on a sphere, then it obeys the relation V - E + F = 2. So in particular, let's look at that mainstay of sci-fi personal defense, the glowing hexagon-tiled force-field bubble. The question arises: can we actually approximate a sphere by hexagons, or at least by six-sided shapes? Suppose we have F faces in our tiling. Each face has 6 edges, but each edge is shared by 2 faces, so we end up with E = 6F/2 = 3F edges. We have 6 vertices per face, but each vertex is shared on some number of faces. Let's call D the average number of faces that each vertex is shared between. Then we have that the number of vertices is V = 6F/D. So Euler's formula becomes (6F/D) - 3F + F = 2. Grouping the left side a little better gives 2F(3/D - 1) = 2, or F(3/D - 1) = 1, or in other words 3/D - 1 = 1/F. If we want F to be finite, then we need 3/D - 1 to be positive, i.e. we need 3/D > 1. But that means that 3 > D, so the average number of faces that each vertex is on has to be less than 3. So at least one vertex is contained in only two hexagons. That should sound weird: the only way you can do that is if the angle of the two edges coming in to the vertex is basically 180 degrees, i.e. if the vertex might as well be an edge. But then we have a pair of pentagons, not hexagons. So we can't actually make a full hexagon-tiled force-field bubble: either we can't get a complete bubble, or we need some pentagons or squares or something. If we insist on just hexagons, we could approximate a torus, but I'm not sure why we'd want to.
Picking Random Numbers One of my new friends who, at the time, hadn't realized what kind of person I am would ask me for a random number when he wanted to get out of making choices, usually involving food. Each time, I responded with "3". "3" is a perfectly good random number, at least as good as any other number if we're being uniform. Upon the face of it, it might seem that by giving "3" all the time, I wasn't being random. Suppose we're looking at numbers between 1 and 5 inclusive. So the chance of picking "3" is 1/5. This is the chance of picking "3" a single time in response to a single prompt. So if I am prompted twice, the chance of me randomly picking "3" for both prompts is 1/5*1/5 = 1/25. For three prompts, the chance of picking "3" for all three prompts is 1/125, and so on. As you can see, for a long sequence of "3"s, the chance of that sequence being the response gets pretty small. So does that mean that I wasn't being random? Well, consider the two-prompt case. What if instead of replying "3" and "3", I replied with "3" and then "4"? The chance of the first response being a "3" is 1/5, and the chance of the second response being a "4" is again 1/5, so the chance of replying "3" and then "4" is 1/25, the same as if I had responded with a "3" both times. And indeed, the chance of any particular sequence of responses depends only on the length of the sequence, rather than the actual contents. So from that perspective, "3, 3, 3, 3, ...." isn't any less likely from a random source than "3, 4, 1, 5,..." But we, as people, would be less surprised by "3, 4, 1, 5,..." coming from a "random" source. Why is that? I mentioned this in a previous post fairly early on in the thread. But one possible means of characterizing "3, 4, 1, 5,..." as different from "3, 3, 3, 3,..." is the entropy of the sequence. In particular, the sample entropy. Suppose we have a bag containing a bunch of "1"s, a bunch of "2"s, a bunch of "3"s, a bunch of "4"s and a bunch of "5"s, where the bunches are various sizes. In particular, suppose that p_{1} is the proportion of "1"s in the bag, p_{2} is the proportion of "2"s, and so on. We can assign a measure of complexity to the bag via these proportions, called the Shannon entropy. Mathematically, it looks like H(bag) = -(p_{1}log(p_{1})+p_{2}log(p_{2})+p_{3}log(p_{3})+p_{4}log(p_{4})+p_{5}log(p_{5})) It measures how much information one expects to need to describe the relative proportions of the contents of the bag.; if you have the logs be base 2, then this is the expected number of bits needed. A bag containing nothing but "3"s has 0 entropy because there's no variation. A bag containing an even mix of "1"s, "2"s, "3"s, "4"s and "5"s has high entropy. So suppose that we have a sequence of numbers. We can consider that as a bag of numbers, and then compute the entropy of that bag. A sequence with high entropy, like "3, 4, 1, 5,..." looks more like a "random" sequence than a sequence with low entropy, "3, 3, 3, 3,...". It's not that "3, 4, 1, 5,..." is any more probable than "3, 3, 3, 3,..." but we expect "random" sequences to have higher entropy, and indeed, most sequences have high entropy rather than low entropy, so picking a sequence at random out of all sequences will probably give you one with high entropy. But that's quite different from how likely any given sequence is if you're picking the terms one at a time. Entropy is also used to predict how a sequence will go when considering its previous terms, where it can give a measure of how surprising the actual next term is compared to what the rest of the sequence indicates the next term would be. If your sequence is just "3"s so far, another "3" won't surprise you much, but a "4" would. Shannon entropy and variants thereof have cropped up in a number of places, and are used for things like biodiversity and machine learning and financial modeling and many contexts where there are notions of "information" and "surprise". Anyway, as my friend eventually guessed, yes, I am giving him "3"s all the time because troll, rather than out of actual randomness. He has figured that if he asks for a number "between 1 and 5 inclusive but not including 3" then I will answer something other than "3", but I responded with "5, then, since it's the most 3-like of the choices" and that confused him enough that I was still entertained.
A description of what math grad students do all day: https://www.quora.com/What-do-grad-students-in-math-do-all-day/answer/Yasha-Berchenko-Kogan
I don't like the right-hand-rule The cross product is one of those oddities that you learn in intro linear algebra, and never really makes all that much sense for people. You might have heard of it in terms of the "right-hand rule" if you're doing electromagnetism in a physics class. The cross product takes two 3-dimensional vectors and spits out a third vector, sort of like multiplication but for vectors instead of just numbers. So for instance, 1 unit up cross 1 unit forward gives you 1 unit to the left. The immediate question, beyond "????", is "why left?" Why not right? Why not backwards or down? Why left? I mean, 1 unit times 1 unit ought to be 1 unit squared, and we'll get to that issue in a moment, but why left? The short answer is some variation of "this is the way it's defined", which is always fairly unsatisfying as an answer. So is there a better reason? Well, a bad reason is that if you take your right hand and point your thumb upward and your index finger forward and then curl the rest of your fingers a bit, those other fingers will point to the left. But that should also be fairly unsatisfying, because why your right hand? Well, there's at least a decent reason that 1 unit up cross 1 unit forward ought to be either 1 unit left or 1 unit right. Consider three 3-dimensional vectors, u, v and w. We can stick them in a matrix, so that for u = (u_{1}, u_{2}, u_{3}), v = (v_{1}, v_{2}, v_{3}) and w = (w_{1}, w_{2}, w_{3}), we get that [u, v, w] = | u_{1} v_{1} w_{1} | | u_{2} v_{2} w_{2} | | u_{3} v_{3} w_{3} | Now we can take the determinant of this matrix, to get det([u, v, w]). If we fix u and v to be particular vectors, then we can define a function, f_{uv}, such that f_{uv}(w) = det([u,v,w]). So we have a function that takes in a vector and spits out a number. If we have two vectors, w and w', then we get that f_{uv}(w+w') = f_{uv}(w)+f_{uv}(w'), and if we have a real number c, then f_{uv}(cw) = cf_{uv}(w). So f_{uv} is linear. Also, using the rules about determinants, we get that f_{uv}(w+u) = f_{uv}(w+v) = f_{uv}(w). So that's our cross product: u cross v = f_{uv}. But wait, you might object, the cross product is a vector, not a function. To which I shake my head in dismay at the education system, but there is an answer. We have a way to take a linear function and turn it into a vector: the dot product. In particular, for a given linear function f, there is exactly one vector s such that f(w) = s · w for all vectors w. So we define the cross product of u and v to be the vector s_{uv} such that s_{uv} · w = f_{uv}(w). Now what can we figure out about s? Well, one thing is that since f_{uv}(w+u) = f_{uv}(w), s_{uv} · (w + u) = s_{uv} · w, so s_{uv} · u = 0. Geometrically, this means that s_{uv} must be perpendicular to u. Similarly, s_{uv} must be perpendicular to v. So that's where the direction comes from: the only directions that are perpendicular to both up and forward are left and right, so the cross product of 1 unit up and 1 unit forward must point either left or right. Which one is a matter of convention, in the sense that picking a side of the road to drive on is a matter of convention: it doesn't really mean much which convention you pick, as long as everyone else uses the same convention. But wait, you might object: the determinant tells us which direction. Again, head shaking. The determinant tells us that the cross-product of (1, 0, 0) and (0, 1, 0) is (0, 0, 1). But which directions are those? Is (1, 0, 0) up? If so, why? Is (0, 1, 0) forward? That doesn't look right. And once you've picked (1, 0, 0) and (0, 1, 0) to be up and forward respectively, should (0, 0, 1) point left or right? Again, why? There's no good reason, there's only convention. Okay, so now some discussion of cross-producty things: There are two fundamental ingredients to define the cross product: the determinant, and the dot product. Geometrically, the determinant corresponds to volume, in that if you define a parallelipiped with one corner at the origin and u, v and w as the edges coming out of that corner you get that the determinant gives the volume of that parallelipiped, but it's a signed volume: det([u, v, w]) = -det([v, u, w]); parallelipipeds don't care which way you order there edges, but the determinant does. Mathematicians call this an orientation: this ordering of the vectors give a positively-oriented parallelipiped, that ordering gives a negatively-oriented parallelipiped. Note: orientation is related to how a Mobius strip only has one side; more on this later, maybe. Anyway, this orientation can be imposed on an abstract R^{3} via the determinant, fine. But it doesn't exist in the real world. That is to say, it may be possible to pick an ordering that can apply to the entire universe, but again that's convention. The other ingredient is the dot product, which in more geometric terms is what allows us to measure lengths and angles. Mathematicians call this a metric. It takes two vectors and spits out a number, and we use this number to assign lengths to vectors: the length of u is the square root of u · u. Now suppose we went up to four dimensions. We still have a dot product here, and we still have a determinant, but the determinant now needs four 4-dimensional vectors, not three. So if you feed two vectors into the determinant, you still have two slots left, i.e. f_{uv} now needs to be fed two vectors in order to spit out a number. So we can't define a cross-product! Or at least, not in the naive fashion. What we can do is define a triple product, f_{uvw}, such that f_{uvw}(x) is a number, and now we can find a vector s_{uvw} such that s_{uvw} · x = f_{uvw}(x). But wait, you might object, what about the quaternions? They're 4-dimensional and have a product. And indeed they do, but it doesn't have all that much to do with the determinant or the dot product in 4-dimensions. In fact, it's more closely related to the cross product in 3-dimensions. Namely, if we write our quaternions as (p, u) and (q, v) where p and q are real numbers and u and v are 3-dimensional vectors, we can define a product (p, u) (q, v) = (pq - u · v, pv + qu + s_{uv}). The problem is really just that the real part of the quaternions doesn't act vector-y; only the imaginary parts do, and since there are 3 imaginary dimensions in the quaternions, the relevant vector space is 3-dimensional, not 4. Now if you're like me, you're more interested in other things that can be said about the cross product. Namely, that the cross product is a Lie bracket. In fact, it's one of the nicer Lie brackets: it's the Lie bracket for so(3) = su(2), which means that it's the Lie bracket for a simple Lie algebra. Also, in that sense, the cross-product in that sense generates rotations in R^{3}. So the cross-product is actually kind of a cool object, more so than the dot product in my opinion. But I still don't like the right-hand-rule.
Hey, @Exohedron have you seen 3Blue1Brown's Essence of Linear Algebra series? He uses this explanation in Chapter 8 Part 2. (He uses some concepts that he explained in other videos, so if anyone else wants to see it you should watch the videos he tells you to watch, first)
I don't like the right-hand rule because most of the time it's used to cover up the fact that it is a convention and not something deeper. The mapping of vectors in R^{3} to directions in physical space is arbitrary anyway, but by imposing the right-hand rule, you're imposing a particular convention and then brushing the fact that it is a convention under the rug. As a mnemonic the right-hand rule is fine, and I do use it because I would never remember the convention otherwise. But it's no more than a mnemonic for an arbitrary convention, and I really wish this were made clearer. Of course, if it weren't a convention then we could deduce it from principles and thus wouldn't need a mnemonic.
Also I guess I have a bias against the insistence that everything has to be a vector. C.f. my ranting in the math advice thread about tensor indices.
A quote about Stokes' theorem, source: Calculus on Manifolds by Michael Spivak: I feel like this, especially the second point, summarizes a good bit of what it means for a piece of mathematics to be understood. It is trivial because the terms appearing in it have been properly defined. In other words, we have figured out all of the surrounding bits and all of the pieces that make up the theorem enough that, with the viewpoint that we use to define those bits and pieces, we realize that the theorem is just putting all of those bits and pieces together and slapping a name on it, that the theorem is nothing that we didn't already know. Once you've gotten your head around the parts of Stokes' theorem, the notions of differential forms and boundary maps, then of course Stokes' theorem is true; the terms are defined in such a way that Stokes' theorem both has to be true and is obviously true. Stokes' theorem is trivial, in the right way to be trivial: not meaningless or valueless as a statement, but obvious and inevitable as a proof. This is not to say that Stokes' theorem is obvious or inevitable when you first see it; if you don't understand the parts, then Stokes' theorem might even seem false. But the parts are properly defined so that once you understand them, once you unravel those definitions, then Stokes' theorem is an immediate consequence. And yet unlike many trivial things, Stokes' theorem also provides further understanding of other statements that may be harder to understand, that may be less trivial except in the light of Stokes' theorem. I suppose the point of this is to say that when a mathematician says that something is trivial, that is not a bad thing. That's not even really a value judgment. A lot of the best things in mathematics are trivial, and are good because they are trivial, but trivial in the sense that they're inevitable and immediate, which is itself a form of elegance. It's a sign that we've found the right framework, the right paradigm, the right viewpoint, because mathematics is a quest for understanding, and being able to say that something is trivial really just means that we understand it so well that we can say that.