Tag Archives: data science

538 – best charts of 2022

There is nothing in 538’s best charts of 2002 that truly bowled me over. I mean, there are some graphics and maps that are effective at telling a story about their underlying data. There just aren’t any types of charts or applications of old types of charts that were a big surprise to me and that I thought I would want to copy if I could. Just purely for personal interest in the subject matter, the one I found most interesting was the map showing how college football conferences are losing all geographic meaning. I find myself slowly being less interested in college football with each passing year, and this is one reason why. My team’s losing campaign, loss to the NFL or “transfer portal” of many of their best players, blowout of the junior varsity squad in the mid-December bowl game they were lucky to even be selected for, and lackluster recruiting class are other reasons.

measuring inflation is hard

Measuring inflation is hard for a variety of reasons, and it gets even harder when you try to compare across countries and regions. Some of the reasons include methodological choices in averaging, weighting, how housing and transportation are accounted for, how urban and rural consumers are included, and many others. There is a measure called the Harmonized Index of Consumer Prices (HICP) that is used to try to compare across countries and regions. This differs from the U.S. CPI in a variety of ways.

College Football? There’s an API for that

I’ve always wondered if there is a public source of college football stats to play with, and there is (at least one) called the College Football Database. There’s also an R package that taps it.

Of course, don’t think for a second that you can crunch these numbers and make money through gambling. Only large “professional gamblers” can consistently make money through gambling, by (legally, as I understand it, at least in certain states) cornering the market by manipulating betting spreads. The idea there is that you can bet a large amount of money on the underdog in a contest that is not getting a lot of attention, which will move the spread in favor of the underdog. You can then bet an even larger amount of money on the favorite. If you are able to manipulate the odds in your favor, you will lose this bet less than half the time, and over time you will make money off the backs of us poor schmucks who take bets with expected values less than what we put in. Don’t try this – there are smarter, richer people than you doing it and you can’t beat them. Also, don’t take my word for it that it would be legal. Finally, think of making small, occasional, close-to-even-money bets as a source of cheap entertainment and you’ll be okay, and then only if you do not have a tendency to become addicted.

An API, by the way, is an Application Programming Interface.

In contrast to a user interface, which connects a computer to a person, an application programming interface connects computers or pieces of software to each other. It is not intended to be used directly by a person (the end user) other than a computer programmer who is incorporating it into software. An API is often made up of different parts which act as tools or services that are available to the programmer. A program or a programmer that uses one of these parts is said to call that portion of the API. The calls that make up the API are also known as subroutines, methods, requests, or endpoints. An API specification defines these calls, meaning that it explains how to use or implement them.

Wikipedia

2021: Year in Review

As per usual, I’ll list out and link to the stories I chose as the most frightening, most hopeful, and most interesting each month in 2021. Then I’ll see if I have anything smart to say about how it all fits together.

Survey of the Year’s Stories and Themes

Most frightening and/or depressing stories:

  • JANUARY: A China-Taiwan military conflict is a potential start-of-World-War-III scenario. This could happen today, or this year, or never. Let’s hope for the latter. This is a near-term existential risk, but I have to break my own “rule of one” and give honorable mention to two longer-term scary things: crashing sperm counts and the climate change/fascism/genocide nexus.
  • FEBRUARY: For people who just don’t care that much about plants and animals, the elevator pitch on climate change is it is coming for our houses and it is coming for our food and water.
  • MARCH: In the U.S. upper Midwest (I don’t know if this region is better or worse than the country as a whole, or why they picked it), electric blackouts average 92 minutes per year, versus 4 minutes per year in Japan.
  • APRIL: One of the National Intelligence Council’s scenarios for 2040 involves “far-reaching changes designed to address climate change, resource depletion, and poverty following a global food catastrophe caused by climate events and environmental degradation”.
  • MAY: The Colorado River basin is drying out.
  • JUNE: For every 2 people who died of Covid-19 in the U.S. about 1 additional person died of indirect effects, such as our lack of a functioning health care system and safe streets compared to virtually all our peer countries.
  • JULY: The western-U.S. megadrought looks like it is settling in for the long haul.
  • AUGUST: The U.S. is not prepared for megadisasters. Pandemics, just to cite one example. War and climate change tipping points, just to cite two others. Solutions or at least risk mitigation measures exist, such as getting a health care system, joining the worldwide effort to deal with carbon emissions, and as for war, how about just try to avoid it?
  • SEPTEMBER: The most frightening climate change tipping points may not be the ones we hear the most about in the media (at least in my case, I was most aware of melting ice sheets in Greenland and Antarctica, collapse of ocean circulation patterns). The most damaging may be melting permafrost on land and methane hydrates underwater, both of which contain enormous amounts of methane which could set off a catastrophic and unstoppable feedback loop if released in large quantities.
  • OCTOBER: The technology (sometimes called “gain of function“) to make something like Covid-19 or something much worse in a laboratory clearly exists right now, and barriers to doing that are much lower than other types of weapons. Also, because I just couldn’t choose this month, asteroids can sneak up on us.
  • NOVEMBER: Freakonomics podcast explained that there is a strong connection between cars and violence in the United States. Because cars kill and injure people on a massive scale, they led to an expansion of police power. Police and ordinary citizens started coming into contact much more often than they had. We have no national ID system so the poor and disadvantaged often have no ID when they get stopped. Everyone has guns and everyone is jumpy. Known solutions (safe street design) and near term solutions (computer-controlled vehicles?) exist, but are we going to pursue them as a society? I guess I am feeling frightened and/or depressed today, hence my choice of category here.
  • DECEMBER: Mass migration driven by climate change-triggered disasters could be the emerging big issue for 2022 and beyond. Geopolitical instability is a likely result, not to mention enormous human suffering.

Most hopeful stories:

  • JANUARY: Computer modeling, done well, can inform decisions better than data analysis alone. An obvious statement? Well, maybe to some but it is disputed every day by others, especially staff at some government regulatory agencies I interact with.
  • FEBRUARY: It is possible that mRNA technology could cure or prevent herpes, malaria, flu, sickle cell anemia, cancer, HIV, Zika and Ebola (and obviously coronavirus). With flu and coronavirus, it may become possible to design a single shot that would protect against thousands of strains. It could also be used for nefarious purposes, and to protect against that are ideas about what a biological threat surveillance system could look like.
  • MARCH: I officially released my infrastructure plan for America, a few weeks before Joe Biden released his. None of the Sunday morning talk shows has called me to discuss so far. Unfortunately, I do not have the resources of the U.S. Treasury or Federal Reserve available to me. Of course, neither does he unless he can convince Congress to go along with at least some portion of his plans. Looking at his proposal, I think he is proposing to direct the fire hoses at the right fires (children, education, research, water, the electric grid and electric vehicles, maintenance of highways and roads, housing, and ecosystems. There is still no real planning involved, because planning needs to be done in between crises and it never is. Still, I think it is a good proposal that will pay off economically while helping real people, and I hope a substantial portion of it survives.
  • APRIL: Giant tortoises reach a state of “negligible senescense” where they simply don’t age for a long time. Humans are distant relatives of giant tortoises, so maybe we can aspire to this some day. They are not invulnerable to injury and disease.
  • MAY: An effective vaccine for malaria may be on the way. Malaria kills more children in Africa every year than Covid-19 killed people of all ages in Africa during the worst year of the pandemic. And malaria has been killing children every year for centuries and will continue long after Covid-19 is gone unless something is done.
  • JUNE: Masks, ventilation, and filtration work pretty well to prevent Covid transmission in schools. We should learn something from this and start designing much healthier schools and offices going forward. Design good ventilation and filtration into all buildings with lots of people in them. We will be healthier all the time and readier for the next pandemic. Then masks can be slapped on as a last layer of defense. Enough with the plexiglass, it’s just stupid and it’s time for it to go.
  • JULY: A new Lyme disease vaccine may be on the horizon (if you’re a human – if you are a dog, talk to your owner about getting the approved vaccine today.) I admit, I had to stretch a bit to find a positive story this month.
  • AUGUST: The Nordic welfare model works by providing excellent benefits to the middle class, which builds the public and political support to collect sufficient taxes to provide the benefits, and so on in a virtuous cycle. This is not a hopeful story for the U.S., where wealthy and powerful interests easily break the cycle with anti-tax propaganda, which ensure benefits are underfunded, inadequate, available only to the poor, and resented by middle class tax payers.
  • SEPTEMBER: Space-based solar power could finally be in our realistic near-term future. I would probably put this in the “interesting” rather than “hopeful” category most months, but I really struggled to come up with a hopeful story this month. I am at least a tiny bit hopeful this could be the “killer app” that gets humanity over the “dirty and scarce” energy hump once and for all, and lets us move on to the next layer of problems.
  • OCTOBER: The situation with fish and overfishing is actually much better than I thought.
  • NOVEMBER: Urban areas may have some ecological value after all.
  • DECEMBER: Covid-19 seems to be “disappearing” in Japan, or at least was before the Omicron wave. Maybe lessons could be learned. It seems possible that East Asian people have at least some genetic defenses over what other ethnic groups have, but I would put my money on tight border screening and an excellent public health care system. Okay, now I’m starting to feel a bit depressed again, sitting here in the U.S. where we can’t have these nice things thanks to our ignorant politicians.

Most interesting stories, that were not particularly frightening or hopeful, or perhaps were a mixture of both:

  • JANUARY: There have been fabulous advances in note taking techniques! Well, not really, but there are some time honored techniques out there that could be new and beneficial for many people to learn, and I think this is an underappreciated productivity and innovation skill that could benefit people in a lot of areas, not just students.
  • FEBRUARY: At least one serious scientist is arguing that Oumuamua was only the tip of an iceberg of extraterrestrial objects we should expect to see going forward.
  • MARCH: One study says 1-2 days per week is a sweet spot for working from home in terms of a positive economic contribution at the national scale. I think it is about right psychologically for many people too. However, this was a very theoretical simulation, and other studies attempting to measure this at the individual or firm scale have come up with a 20-50% loss in productivity. I think the jury is still out on this one, but I know from personal experience that people need to interact and communicate regularly for teams to be productive, and some people require more supervision than others, and I don’t think technology is a perfect substitute for doing these things in person so far.
  • APRIL: Hydrogen fuel cells may finally be arriving. Not so much in the U.S., where we can’t have nice things.
  • MAY: I learned about Lawrence Kohlberg, who had some ideas on the use of moral dilemmas in education.
  • JUNE: The big U.S. government UFO report was a dud. But what’s interesting about it is that we have all quietly seemed to have accepted that something is going on, even if we have no idea what it is, and this is new.
  • JULY: “Cliodynamics” is an attempt at a structured, evidence-based way to test hypotheses about history.
  • AUGUST: Ectogenesis is an idea for colonizing other planets that involves freezing embryos and putting them on a spaceship along with robots to thaw them out and raise them. Fungi could also be very useful in space, providing food, medicine, and building materials.
  • SEPTEMBER: Philip K. Dick was not only a prolific science fiction author, he also developed a comprehensive theory of religion which could possibly even be the right one. Also, possibly related but not really, if there are aliens out there they might live in creepy colonies or super-organisms like ants or termites.
  • OCTOBER: I thought about how to accelerate scientific progress: “[F]irst a round of automated numerical/computational experiments on a huge number of permutations, then a round of automated physical experiments on a subset of promising alternatives, then rounds of human-guided and/or human-performed experiments on additional subsets until you hone in on a new solution… [C]ommit resources and brains to making additional passes through the dustbin of rejected results periodically…” and finally “educating the next generation of brains now so they are online 20 years from now when you need them to take over.” Easy, right?
  • NOVEMBER: Peter Turchin continues his project to empirically test history. In this article, he says the evidence points to innovation in military technologies being driven by “world population size, connectivity between geographical areas of innovation and adoption, and critical enabling technological advances, such as iron metallurgy and horse riding“. What does not drive innovation? “state-level factors such as polity population, territorial size, or governance sophistication“. As far as the technologies coming down the pike in 2022, one “horizon scan” has identified “satellite megaconstellations, deep sea mining, floating photovoltaics, long-distance wireless energy, and ammonia as a fuel source”.
  • DECEMBER: Time reminded us of all the industries Elon Musk has disrupted so far: human-controlled, internal-combustion-fueled automobiles; spaceflight; infrastructure construction (I don’t know that he has really achieved any paradigm shifts here, but not for lack of trying), “artificial intelligence, neurotechnology, payment systems and cryptocurrency.” I’m not sure I follow a couple of these, but I think they missed satellites.

Continuing Signs of U.S. Relative Decline

Signs of U.S. decline relative to our peer group of advanced nations are all around us. I don’t know that we are in absolute decline, but I think we are now below average among the most advanced countries in the world. We are not investing in the infrastructure needed in a modern economy just to reduce friction and let the economy function. The annual length of electric blackouts in the U.S. (hours) compared to leading peers like Japan (minutes) is just one telling indicator. In March, I looked at the Build Back Better proposal and concluded that it was more like directing a firehose of money at a range of problems than an actual plan, but I hoped at least some of it would happen. My rather low but not zero expectations were met, as some limited funding was provided for “hard infrastructure” and energy/emissions projects, but little or nothing (so far, as I write this) to address our systemic failures in health care, child care, or education. The crazy violence on our streets, both gun-related and motor vehicle-related, is another indicator. Known solutions to all these problems exist and are being implemented to various extents by peer countries. Meanwhile our toxic politics and general ignorance continue to hold us back. Biden really gave it his best shot – but if this is our “once in a generation” attempt, we are headed down a road where we will no longer qualify as a member of the pack of elite countries, let alone its leader.

The Climate Change, Drought, Food, Natural Disaster, Migration and Geopolitical Instability Nexus

2021 was a pretty bad year for storms, fires, floods, and droughts. All these things affect our homes, our infrastructure, our food supply, and our water supply. Drought in particular can trigger mass migration. Mass migration can be a disaster for human rights and human dignity in and of itself, and managing it effectively is difficult even for well-intentioned governments. But an insidious related problem is that migration pressure can tend to fuel right wing populist and racist political movements. We see this happening all over the world, and the situation seems likely to get worse.

Tipping Points and other Really Bad Things We Aren’t Prepared For

We can be thankful that nothing really big and new and bad happened in 2021. My apologies to anyone reading this who lost someone or had a tough year. Of course, plenty of bad things happened to good people, and plenty of bad things happened on a regional or local scale. But while Covid-19 ground on and plenty of local and regional-scale natural disasters and conflicts occurred, there were no new planetary-scale disasters. This is good because humanity has had enough trouble dealing with Covid-19, and another major disaster hitting at the same time could be the one that brings our civilization to the breaking point.

So we have a trend of food insecurity and migration pressure creeping up on us over time, and we are not handling it well even given time to do so. Maybe we can hope that some adjustments will be made there to get the world on a sustainable track. Even if we do that, there are some really bad things that could happen suddenly. Catastrophic war is an obvious one. A truly catastrophic pandemic is another (as opposed to the moderately disastrous pandemic we have just gone through.) Creeping loss of human fertility is one that is not getting much attention, but this seems like an existential risk if it were to cross some threshold where suddenly the global population starts to drop quickly and we can’t do anything about it. Asteroids were one thing I really thought we didn’t have to worry much about on the time scale of any human alive today, but I may have been wrong about that. And finally, the most horrifying risk to me in the list above is the idea of an accelerating, runaway feedback loop of methane release from thawing permafrost or underwater methane hydrates.

We are almost certainly not managing these risks. These risks are probably not 100% avoidable, but since they are existential we should be actively working to minimize the chance of them happening, preparing to respond in real time, and preparing to recover afterward if they happen. Covid-19 was a dress rehearsal for dealing with a big global risk event, and humanity mostly failed to prepare or respond effectively. We are lucky it was one we should be able to recover from as long as we get some time before the next body blow. We not only need to prepare for much, much worse events that could happen, we need to match our preparations to the likelihood of more than one of them happening at the same time or in quick succession.

Technological Progress

Enough doom and gloom. We humans are here, alive, and many of us are physically comfortable and have much more leisure time than our ancestors. Our social, economic, and technological systems seem to be muddling through from day to day for the time being. We have intelligence, science, creativity, and problem solving abilities available to us if we choose to make use of them. Let’s see what’s going on with technology.

Biotechnology: The new mRNA technology accelerated by the pandemic opens up potential cures for a range of diseases. We need an effective biological surveillance system akin to nuclear weapons inspections (which we also need) to make sure it is not misused (oops, doom and gloom trying to creep in, but there are some ideas for this.) We have vaccines on the horizon for diseases that have been plaguing us for decades or longer, like malaria and Lyme disease. Malaria kills more children worldwide, year in and year out, than coronavirus has killed per year at its peak.

Promising energy technologies: Space based solar power may finally be getting closer to reality. Ditto for hydrogen fuel cells in vehicles, although not particularly in the U.S. (I’m not sure this is preferable to electric vehicles for everyday transportation, but it seems like a cleaner alternative to diesel and jet fuel when large amounts of power are needed in trucking, construction, and aviation, for example.)

Other technologies: We are actually using technology to catch fish in more sustainable ways, and to grow fish on farms in more sustainable ways. We are getting better at looking for extraterrestrial objects, and the more we look, the more of them we expect to see (this one is exciting and scary at the same time). We are putting satellites in orbit on an unprecedented scale. We have computers, robots, artificial intelligence of a sort, and approaches to use them to potentially accelerate scientific advancements going forward.

The State of Earth’s Ecosystems

The state and trends of the Earth’s ecosystems continue to be concerning. Climate change continues to churn through the public consciousness and our political systems, and painful as the process is I think our civilization is slowly coming to a consensus that something is happening and something needs to be done about it (decades after we should have been able to do this based on the evidence and knowledge available.) When it comes to our ecosystems, however, I think we are in the very early stages of this process. This is something I would like to focus on in this blog in the coming year. My work and family life are busy, and I have decided to take on an additional challenge of becoming a student again for the first time in the 21st century, but somehow I will persevere. If you are reading this shortly after I write it in January 2022, here’s to good luck and prosperity in the new year!

Pension funds should never rely on correlation

Pension funds should not rely on correlations between mean annual return and variance in annual return when deciding how much stocks and bonds to own, according to this article on which Nassim Nicholas Taleb (the Black Swan guy) is the second author. To paraphrase/oversimplify my understanding of the article greatly, the main arguments are that (1) data from the past is not a perfect predictor of the future, and (2) short term volatility is not a good measure of the risk of achieving a long term goal.

In engineering, I hear #1 all the time from people – why don’t we rely on data instead of “modeling” when trying to predict the future? Of course we do both – try to understand the underlying structure of the system we are dealing with, then use data from the past to try to confirm that we got it right, at least for the conditions that prevailed when the data were collected (and assuming the data themselves are reasonably accurate or at least any measurement error is not biased one way or the other), and then use the resulting model of the system to try to predict the future. Conditions in the future may be different than conditions in the past, and that is why we don’t “just rely on data”. If external conditions are different but the underlying structure of the system doesn’t change (much), we can come up with reasonable predictions of the future. The only true test of whether the prediction is right comes from data which will be collected in the future, but is not available today when a decision has to be made. A lot of decisions are really just playing the odds about what might work in the most likely future, or what might work across several different possible futures that collectively are very likely (a “robust” decision). The decision that is best for the single most likely condition and a group of very likely conditions may not be the same one – now you are a gambler trying to decide whether you go for the biggest possible payoff while accepting a larger chance of a loss, or whether you want to maximize your chances of a positive payoff while giving up your shot at a really big payoff. You would think the pension fund would go for the latter.

#2 makes sense to me. Variability in annual returns doesn’t matter much if you are 25 and investing money you plan to need at 65. A pension fund is a little different, because it is essentially immortal but has obligations it has to meet each year.

In the case of investment returns, the approach seems to be almost purely “data-driven” with no real understanding of the underlying system, and this leads to an existential crisis when people try to figure out what asset allocation advice to stake their future on. We understand the real economy to some extent, we think, but we don’t really seem to confidently understand how the real economy and the financial economy are related, especially over shorter time frames. So we are reduced to just describing the data, which might lead to some insights about the system but has limited predictive value. Still, examining the evidence before making a decision seems like a good idea to me. What is the alternative – guessing, wishing, praying?

those wild, wacky Covid-19 data points

I have noticed for awhile that the CDC’s Covid-19 data doesn’t agree with other sources, which don’t agree with each other. Looking at my home city (and County) of Philadelphia, the CDC’s numbers have been consistently higher for many months. This matters because government agencies, employers (including mine), and individuals are basing decisions on these numbers, often the CDC numbers.

Let’s look at today’s numbers for Philadelphia. I’ll look just at “confirmed cases” because that seems to be the most readily available and frequently updated by all sources, although really I think we should be focused more on deaths at this point, because deaths (although morbid) gives you some information on cases and vaccination/immunity combined. In other words, if cases are high but deaths are low, you would have an annoyance but not a major problem. Nonetheless, let’s look at those cases for Philadelphia today! I’m writing this on Sunday, November 21, 2021. I’m using the links from my coronavirus tracker post.

  • CDC: 111.55 / 100,000 population / 7 days (data from November 13-19)
  • Pennsylvania state health department: 86.4 / 100,000 population / 7 days (data from November 12-18)
  • Covid Act Now: 116.2 / 100,000 population / 7 days (data from November 20 which they describe as a 7 day average provided by the New York Times)

There are a number of things that could explain differences in the numbers. First, the time periods the data represent varying slightly by source. Second, whether the data represent the date the test was done, the test was reported, or the estimated date of infection. Generally I think what is reported is the date the test was done. This is hard data of a sort, but it introduces a time lag as numerous and scattered labs report their data. The data you are looking at might not yet represent all the data available on a given day, and it might be corrected retroactively, meaning if you check what today’s number was a week from now, you might see a different number from today. Finally, when reporting data for a location like a county, it may be important whether they are reporting all tests done in that county or matching tests to the home addresses (or employer addresses?) of the individuals tested. Philadelphia, for example, has a huge health care industry with a lot of commuters not just from surrounding counties in Pennsylvania but parts of New Jersey and Delaware. (States were never the right entities to track this pandemic, it should obviously be done by entities covering metro areas.)

If all the sources were using similar data but using slightly different time periods or calculation methods, I would expect some differences but I would expect the differences to be random. The state health department numbers are consistently lower, however. I am hoping that might be because they are doing a better job of matching tests to home addresses.

more Peter Turchin

Here’s a new journal article from Peter Turchin and his Seshat database to empirically test hypotheses about history.

Rise of the war machines: Charting the evolution of military technologies from the Neolithic to the Industrial Revolution What have been the causes and consequences of technological evolution in world history?

In particular, what propels innovation and diffusion of military technologies, details of which are comparatively well preserved and which are often seen as drivers of broad socio-cultural processes? Here we analyze the evolution of key military technologies in a sample of pre-industrial societies world-wide covering almost 10,000 years of history using Seshat: Global History Databank. We empirically test previously speculative theories that proposed world population size, connectivity between geographical areas of innovation and adoption, and critical enabling technological advances, such as iron metallurgy and horse riding, as central drivers of military technological evolution. We find that all of these factors are strong predictors of change in military technology, whereas state-level factors such as polity population, territorial size, or governance sophistication play no major role. We discuss how our approach can be extended to explore technological change more generally, and how our results carry important ramifications for understanding major drivers of evolution of social complexity.

PLOS One

Glancing through the methods confirms my suspicion that big data or machine learning analyses pretty much start from old-school correlation and regression, then branch out (sometimes literally in things called “trees”) from there.

July 2021 in Review

July 2021 is in the books. In current events (I’m writing on Sunday, August 1), the Delta variant of Covid is now ripping through the unvaccinated population in the U.S. and predictably leaking out into the vaccinated population. I wasn’t too focused on Covid in July though, looking at the posts I have chosen below.

Most frightening and/or depressing story: The western-U.S. megadrought looks like it is settling in for the long haul.

Most hopeful story: A new Lyme disease vaccine may be on the horizon (if you’re a human – if you are a dog, talk to your owner about getting the approved vaccine today.) I admit, I had to stretch a bit to find a positive story this month.

Most interesting story, that was not particularly frightening or hopeful, or perhaps was a mixture of both: “Cliodynamics” is an attempt at a structured, evidence-based way to test hypotheses about history.