Tag Archives: data science

eyes on the street

A group at the University of Pennsylvania looked for statistical evidence that “eyes on the street” are a deterrent to crime. The results are a bit puzzling, as real world data often can be.


Statistical analyses of urban environments have been recently improved through publicly available high resolution data and mapping technologies that have adopted across industries. These technologies allow us to create metrics to empirically investigate urban design principles of the past half-century. Philadelphia is an interesting case study for this work, with its rapid urban development and population increase in the last decade. We focus on features of what urban planners call vibrancy: measures of positive, healthy activity or energy in an area. Historically, vibrancy has been very challenging to measure empirically. We explore the association between safety (violent and non-violent crime) and features of local neighborhood vibrancy such as population, economic measures and land use zoning. Despite rhetoric about the negative effects of population density in the 1960s and 70s, we find very little association between crime and population density. Measures based on land use zoning are not an adequate description of local vibrancy and so we construct a database and set of measures of business activity in each neighborhood. We employ several matching analyses within census block groups to explore the relationship between neighborhood vibrancy and safety at a higher resolution. We fi nd that neighborhoods with more vacancy have higher crime but within neighborhoods, crimes tend not to be located near vacant properties. We also find that more crimes occur near business locations but businesses that are active (open) for longer periods are associated with fewer crimes.

This is particularly fascinating to me because I live my life in the middle of this particular data set and am part of it. So it is very interesting to compare what the data seem to be saying with my own experiences and impressions.

The lack of correlation between population density and crime is not surprising. Two neighborhoods with identical density can be drastically different. The correlation between poverty and crime is not surprising – people who are not succeeding in the formal economy and who are not mobile turn to the informal economy, in other words drug dealing, loan sharking and other illegal ways of trying to earn an income. If they are successful at earning an income, they tend to have a lot of cash around, and other people who know about the cash will take advantage of them, knowing they will not go to the police. Other than going to the police, the remaining options are to be taken advantage of repeatedly, or to retaliate. This is how violence escalates, I believe, and it goes hand in hand with development of a culture that tolerates and even celebrates violence, in a never-ending feedback loop.

The puzzling part comes when they try to drill down and look at explanatory factors at a very fine spatial scale. They found a correlation between crime and mixed use zoning, which appears to contradict the idea that eyes on the street around the clock will help to deter crime. And they found more crime around businesses like cafes, restaurants, bars and retail shops. They found that longer open hours seemed to have some deterrent effect on crime relative to shorter open hours.

I think they have made an excellent effort to do this, and I am not sure it can be done a lot better, but I will point out one idea I have. They talk about some limitations and nuances of their data, but one they do not mention is the idea that they are looking at reported crimes, most likely police reports or 911 calls. It could be that business owners, staff and patrons are much more likely to call 911 and report a crime than are residential neighbors. The business staff and patrons may see this as being in the economic interest, increasing the safety of their families, and the (alleged) criminals they are reporting are generally strangers. In quieter all-residential neighborhoods, people may not observe as many of the crimes that do occur (fewer “eyes on the street”), they may prefer not to report crimes either through a sense of loyalty to one’s neighbors, minding one’s own business, quid pro quo, or in some cases a fear of retaliation. There is also the factor of some demographic groups trusting the police more than others, although the authors’ statistical attempts to control for demographics may tend to factor this out.


data-ink ratio

Here’s a wiki post about Edward Tufte’s data-ink ratio:

Tufte refers to data-ink as the non-erasable ink used for the presentation of data. If data-ink would be removed from the image, the graphic would lose the content. Non-Data-Ink is accordingly the ink that does not transport the information but it is used for scales, labels and edges. The data-ink ratio is the proportion of Ink that is used to present actual data compared to the total amount of ink (or pixels) used in the entire display. (Ratio of Data-Ink to non-Data-Ink).

Good graphics should include only data-Ink. Non-Data-Ink is to be deleted everywhere where possible. The reason for this is to avoid drawing the attention of viewers of the data presentation to irrelevant elements.

The goal is to design a display with the highest possible data-ink ratio (that is, as close to the total of 1.0), without eliminating something that is necessary for effective communication.

Before I offer an opinion,  I should state the disclaimer that you should definitely listen to Edward Tufte, not me! So here’s my opinion: this idea is clearly absurd when taken to extremes because it would just mean a bunch of dots on a page that you have no way of interpreting. I can’t think of a way of making graphs without axes, scales, and a legend. Labels, arrows, and text boxes are an alternative which I find myself using often when giving projected slide presentations in fairly large rooms.

A reasonable interpretation of Tufte, I think, is to ask yourself whether each new thing you are adding to a graph provides useful information to the reader/viewer, increases the chances that the reader/viewer will draw the right conclusions, and makes the reader/viewer’s job easier or harder. The holy grail is to help your audience imbibe the point of the graph with very little effort. Unnecessary 3D effects and clip art aren’t going to do that. A splash of color and some nice big labels that middle aged people can read from the back of the room just might help.

R and differential equations

Here’s a new R package for solving differential equations. Sounds like something that might be of interest to only a few ivory tower mathematicians, right? But solving differential equations numerically is the critical core of almost any dynamic simulation model, whether it is simulating water, energy, money, ecology, social systems, or the intertwinings of all of these. So if we are going to understand our systems well enough to solve their problems, we have to have some people around who understand these things on a practical level.

iris scans

Border counties in Texas are using mandatory iris scans to build a database of illegal immigrants. I imagine it will spread to big city police departments, and then to everywhere else. I imagine at some point it will become a form of identification people can use as an alternative to carrying a wallet and passport. I don’t know that the technology is concerning in and of itself – it’s essentially just a modern and accurate form of identification. What’s concerning is what some immoral governments, amoral corporations, and criminal elements might be able to do with large databases of this type of information.

synergy, uniqueness, and redundancy in interacting environmental variables

This is a bit over my head, but one thing I am interested in is analyzing and making sense of a large number of simultaneous time series, whether measured in the environment, the economy, or output of a computer model. This can easily be overwhelming, so one place people often start is trying to figure out which time series are telling essentially the same story, or directly opposite stories. Understanding this allows you to reduce the number of variables you need to analyze to a more manageable number. Time series make this more complicated though, because two variables could be telling the same or opposite stories, but if the signals are offset in time, simple ways of looking at correlation may not lead to the right conclusions. With simulations you have yet another set of complicating factors, which is the implicit links between your variables, intended or not, and whether they exist in the real world or not.

Temporal information partitioning: Characterizing synergy, uniqueness, and redundancy in interacting environmental variables

Information theoretic measures can be used to identify non-linear interactions between source and target variables through reductions in uncertainty. In information partitioning, multivariate mutual information is decomposed into synergistic, unique, and redundant components. Synergy is information shared only when sources influence a target together, uniqueness is information only provided by one source, and redundancy is overlapping shared information from multiple sources. While this partitioning has been applied to provide insights into complex dependencies, several proposed partitioning methods overestimate redundant information and omit a component of unique information because they do not account for source dependencies. Additionally, information partitioning has only been applied to time-series data in a limited context, using basic pdf estimation techniques or a Gaussian assumption. We develop a Rescaled Redundancy measure (Rs) to solve the source dependency issue, and present Gaussian, autoregressive, and chaotic test cases to demonstrate its advantages over existing techniques in the presence of noise, various source correlations, and different types of interactions. This study constitutes the first rigorous application of information partitioning to environmental time-series data, and addresses how noise, pdf estimation technique, or source dependencies can influence detected measures. We illustrate how our techniques can unravel the complex nature of forcing and feedback within an ecohydrologic system with an application to 1-minute environmental signals of air temperature, relative humidity, and windspeed. The methods presented here are applicable to the study of a broad range of complex systems composed of interacting variables.

May 2017 in Review

Most frightening stories:

  • The public today is more complacent about nuclear weapons than they were in the 1980s, even though the risk is arguably greater and leaders seem to be more ignorant and reckless.
  • The NSA is trying “to identify laboratories and/or individuals who may be involved in nefarious use of genetic research”.
  • We hit 410 ppm at Mauna Loa.

Most hopeful stories:

Most interesting stories, that were not particularly frightening or hopeful, or perhaps were a mixture of both:

  • Some experts think the idea of national sovereignty itself is now in doubt.
  • Taser wants to record everything the police do, everywhere, all the time, and use artificial intelligence to make sense of the data.
  • The sex robots are here.

April 2017 in Review

Most frightening stories:

Most hopeful stories:

Most interesting stories, that were not particularly frightening or hopeful, or perhaps were a mixture of both:

  • I first heard of David Fleming, who wrote a “dictionary” that provides “deft and original analysis of how our present market-based economy is destroying the very foundations―ecological, economic, and cultural― on which it depends, and his core focus: a compelling, grounded vision for a cohesive society that might weather the consequences.”
  • Judges are relying on algorithms to inform probation, parole, and sentencing decisions.
  • I finished reading Rainbow’s End, a fantastic Vernor Vinge novel about augmented reality in the near future, among other things.

Richard Berk

Here’s a Bloomberg article on Richard Berk, a statistician at the University of Pennsylvania whose algorithms are used for parole, probation, and sentencing decisions.

Risk scores, generated by algorithms, are an increasingly common factor in sentencing. Computers crunch data—arrests, type of crime committed, and demographic information—and a risk rating is generated. The idea is to create a guide that’s less likely to be subject to unconscious biases, the mood of a judge, or other human shortcomings. Similar tools are used to decide which blocks police officers should patrol, where to put inmates in prison, and who to let out on parole. Supporters of these tools claim they’ll help solve historical inequities, but their critics say they have the potential to aggravate them, by hiding old prejudices under the veneer of computerized precision. Some people see them as a sterilized version of what brought protesters into the streets at Black Lives Matter rallies…


reproducible research in hydrology

This October 2016 article in Water Resources Research on reproducible research got some attention.

Hutton, C., T. Wagener, J. Freer, D. Han, C. Duffy, and B. Arheimer (2016), Most computational hydrology is not reproducible, so is it really science?, Water Resour. Res., 52, 7548–7555, doi:10.1002/2016WR019285.

Reproducibility is a foundational principle in scientific research. Yet in computational hydrology the code and data that actually produces published results are not regularly made available, inhibiting the ability of the community to reproduce and verify previous findings. In order to overcome this problem we recommend that reuseable code and formal workflows, which unambiguously reproduce published scientific results, are made available for the community alongside data, so that we can verify previous findings, and build directly from previous work. In cases where reproducing large-scale hydrologic studies is computationally very expensive and time-consuming, new processes are required to ensure scientific rigor. Such changes will strongly improve the transparency of hydrological research, and thus provide a more credible foundation for scientific advancement and policy support.

slot machines and pseudo-random numbers

Russian hackers have been able to beat a certain brand of slot machine by buying old models, studying their coding, and figuring out the pattern of random numbers they generate. From Wired:

But as the “pseudo” in the name suggests, the numbers aren’t truly random. Because human beings create them using coded instructions, PRNGs can’t help but be a bit deterministic. (A true random number generator must be rooted in a phenomenon that is not manmade, such as radioactive decay.) PRNGs take an initial number, known as a seed, and then mash it together with various hidden and shifting inputs—the time from a machine’s internal clock, for example—in order to produce a result that appears impossible to forecast. But if hackers can identify the various ingredients in that mathematical stew, they can potentially predict a PRNG’s output. That process of reverse engineering becomes much easier, of course, when a hacker has physical access to a slot machine’s innards.

Knowing the secret arithmetic that a slot machine uses to create pseudorandom results isn’t enough to help hackers, though. That’s because the inputs for a PRNG vary depending on the temporal state of each machine. The seeds are different at different times, for example, as is the data culled from the internal clocks. So even if they understand how a machine’s PRNG functions, hackers would also have to analyze the machine’s gameplay to discern its pattern…

… the operatives use their phones to record about two dozen spins on a game they aim to cheat. They upload that footage to a technical staff in St. Petersburg, who analyze the video and calculate the machine’s pattern based on what they know about the model’s pseudorandom number generator. Finally, the St. Petersburg team transmits a list of timing markers to a custom app on the operative’s phone; those markers cause the handset to vibrate roughly 0.25 seconds before the operative should press the spin button.