Tags: data

400 Friday, August 29th, 2025

Databasing

A few years back, Craig wrote a great piece called Fast Software, the Best Software:

Speed in software is probably the most valuable, least valued asset. To me, speedy software is the difference between an application smoothly integrating into your life, and one called upon with great reluctance.

Nelson Elhage said much the same thing in his reflections on software performance:

I’ve really come to appreciate that performance isn’t just some property of a tool independent from its functionality or its feature set. Performance — in particular, being notably fast — is a feature in and of its own right, which fundamentally alters how a tool is used and perceived.

Or, as Robin put it:

I don’t think a website can be good until it’s fast.

Those sentiments underpin The Session. Speed is as much a priority as usability, accessibility, privacy, and security.

I’m fortunate in that the site doesn’t have an underlying business model at odds with these priorities. I’m under no pressure to add third-party code that would track users and slow down the website.

When it comes to making fast websites, most of the obstacles are put in place by front-end development, mostly JavaScript. I’ve been pretty ruthless in my pursuit of speed on The Session, removing as much JavaScript as possible. On the bigger pages, the bottleneck now is DOM size rather than parsing and excuting JavaScript. As bottlenecks go, it’s not the worst.

But even with all my core web vitals looking good, I still have an issue that can’t be solved with front-end optimisations. Time to first byte (or TTFB if you’d rather use an initialism that takes just as long to say as the words it’s replacing).

When it comes to reducing the time to first byte, there are plenty of factors that are out of my control. But in the case of The Session, something I do have control over is the server set-up, specifically the database.

Now I could probably solve a lot of my speed issues by throwing money at the problem. If I got a bigger better server with more RAM and CPUs, I’m pretty sure it would improve the time to first byte. But my wallet wouldn’t thank me.

(It’s still worth acknowledging that this is a perfectly valid approach when it comes to back-end optimisation that isn’t available on the front end; you can’t buy all your users new devices.)

So I’ve been spending some time really getting to grips with the MySQL database that underpins The Session. It was already normalised and indexed to the hilt. But perhaps there were server settings that could be tweaked.

This is where I have to give a shout-out to Releem, a service that is exactly what I needed. It monitors your database and then over time suggests configuration tweaks, explaining each one along the way. It’s a seriously good service that feels as empowering as it is useful.

I wish I could afford to use Releem on an ongoing basis, but luckily there’s a free trial period that I could avail of.

Thanks to Releem, I was also able to see which specific queries were taking the longest. There was one in particular that had always bothered me…

If you’re a member of The Session, then you can see any activity related to something you submitted in the past. Say, for example, that you added a tune or an event to the site a while back. If someone else comments on that, or bookmarks it, then that shows up in your “notifications” feed.

That’s all well and good but under the hood it was relying on a fairly convuluted database query to a very large table (a table that’s effectively a log of all user actions). I tried all sorts of query optimisations but there always seemed to be some combination of circumstances where the request would take ages.

For a while I even removed the notifications functionality from the site, hoping it wouldn’t be missed. But a couple of people wrote to ask where it had gone so I figured I ought to reinstate it.

After exhausting all the technical improvements, I took a step back and thought about the purpose of this particular feature. That’s when I realised that I had been thinking about the database query too literally.

The results are ordered in reverse chronological order, which makes sense. They’re also chunked into groups of ten, which also makes sense. But I had allowed for the possibility that you could navigate through your notifications back to the very start of your time on the site.

But that’s not really how we think of notifications in other settings. What would happen if I were to limit your notifications only to activity in, say, the last month?

Boom! Instant performance improvement by orders of magnitude.

I guess there’s a lesson there about switching off the over-analytical side of my brain and focusing on actual user needs.

Anyway, thanks to the time I’ve spent honing the database settings and optimising the longest queries, I’ve reduced the latency by quite a bit. I’m hoping that will result in an improvement to the time to first byte.

Time—and monitoring tools—will tell.

Saturday, July 19th, 2025

Beautiful Public Data

A curated selection of visually interesting datasets collected by local, state and federal government agencies.

This site must’ve started as a way of showcasing really interesting collections, but now it’s turning into an archive of what’s being systematically destroyed by the current US regime.

Sunday, May 18th, 2025

Session spider

Here’s some code to show the distance to the nearest airports on a map.

Here’s a modified version that shows the distance to the nearest Gregg’s. The hub-and-spoke visualisation overlaid on the map changes as you pan around, making it look like a spider bestriding the landscape.

Jonty’s version shows the distance to the nearest Pret a Manger.

I got nerdsniped by someone saying:

@adactio This would be cool for sessions 😉

He’s right, dammit! So here you go:

Session spider.

Now you can see how far you are from the nearest traditional Irish music sessions.

It’s using data from the weekly data dumps from thesession.org—I added a GeoJSON file in there.

Pure silliness, but it does make me wonder what kind of actually good data visualisations could be made with all this scrumptious data.

Wednesday, April 30th, 2025

Codewashing

I have little understanding for people using large language models to generate slop; words and images that nobody asked for.

I have more understanding for people using large language models to generate code. Code isn’t the thing in the same way that words or images are; code is the thing that gets you to the thing.

And if a large language model hallucinates some code, you’ll find out soon enough:

With code you get a powerful form of fact checking for free. Run the code, see if it works.

But I want to push back on one justification I see repeatedly about using large language models to write code. Here’s Craig:

There are many moral and ethical issues with using LLMs, but building software feels like one of the few truly ethically “clean”(er) uses (trained on open source code, etc.)

That’s not how this works. Yes, the large language models are trained on lots of code (most of it open source), but they’re not only trained on that. That’s on top of everything else; all the stolen books, all the unpaid creative work of others.

Even Robin Sloan, who first says:

I think the case of code is especially clear, and, for me, basically settled.

…goes on to acknowledge:

But, again, it’s important to say: the code only works because of Everything. Take that data away, train a model using GitHub alone, and you’ll get a far less useful tool.

When large language models are trained on domain-specific data, it’s always in addition to the mahoosive amount of content they’ve already stolen. It’s that mohoosive amount of content—not the domain-specific data—that enables them to parse your instructions.

(Note that I’m being very delibarate in saying “parse”, not “understand.” Though make no mistake, I’m astonished at how good these tools are at parsing instructions. I say that as someone who tried to write natural language parsers for text-only adventure games back in the 1980s.)

So, sure, go ahead and use large language models to write code. But don’t fool yourself into thinking that it’s somehow ethical.

What I said here applies to code too:

If you’re going to use generative tools powered by large language models, don’t pretend you don’t know how your sausage is made.

Tuesday, February 18th, 2025

Own what’s yours

Now, more than ever, it’s critical to own your data. Really own it. Like, on your hard drive and hosted on your website.

Is taking control of your content less convenient? Yeah–of course. That’s how we got in this mess to begin with. It can be a downright pain in the ass. But it’s your pain in the ass. And that’s the point.

Sunday, February 16th, 2025

My Life in Weeks by Gina Trapani

This is one way of putting things into perspective.

Tuesday, January 21st, 2025

The New York Good Times

Better than the real thing. All true too.

Refresh for more.

What I’ve learned about writing AI apps so far | Seldo.com

LLMs are good at transforming text into less text

Laurie is really onto something with this:

This is the biggest and most fundamental thing about LLMs, and a great rule of thumb for what’s going to be an effective LLM application. Is what you’re doing taking a large amount of text and asking the LLM to convert it into a smaller amount of text? Then it’s probably going to be great at it. If you’re asking it to convert into a roughly equal amount of text it will be so-so. If you’re asking it to create more text than you gave it, forget about it.

Depending how much of the hype around AI you’ve taken on board, the idea that they “take text and turn it into less text” might seem gigantic back-pedal away from previous claims of what AI can do. But taking text and turning it into less text is still an enormous field of endeavour, and a huge market. It’s still very exciting, all the more exciting because it’s got clear boundaries and isn’t hype-driven over-reaching, or dependent on LLMs overnight becoming way better than they currently are.

Saturday, January 18th, 2025

Public Domain Image Archive

Explore our hand-picked collection of 10,046 out-of-copyright works, free for all to browse, download, and reuse. This is a living database with new images added every week.

Friday, January 17th, 2025

Changing

It always annoys me when a politician is accused of “flip-flopping” when they change their mind on something. Instead of admiring someone for being willing to re-examine previously-held beliefs, we lambast them. We admire conviction, even though that’s a trait that has been at the root of history’s worst attrocities.

When you look at the history of human progress, some of our greatest advances were made by people willing to question their beliefs. Prioritising data over opinion is what underpins the scientific method.

But I get it. It can be very uncomfortable to change your mind. There’s inevitably going to be some psychological resistance, a kind of inertia of opinion that favours the sunk cost of all the time you’ve spent believing something.

I was thinking back to times when I’ve changed my opinion on something after being confronted with new evidence.

In my younger days, I was staunchly anti-nuclear power. It didn’t help that in my younger days, nuclear power and nuclear weapons were conceptually linked in the public discourse. In the intervening years I’ve come to believe that nuclear power is far less destructive than fossil fuels. There are still a lot of issues—in terms of cost and time—which make nuclear less attractive than solar or wind, but I honestly can’t reconcile someone claiming to be an environmentalist while simultaneously opposing nuclear power. The data just doesn’t support that conclusion.

Similarly, I remember in the early 2000s being opposed to genetically-modified crops. But the more I looked into the facts, there was nothing—other than vibes—to bolster that opposition. And yet I know many people who’ve maintainted their opposition, often the same people who point to the scientific evidence when it comes to climate change. It’s a strange kind of cognitive dissonance that would allow for that kind of cherry-picking.

There are other situations where I’ve gone more in the other direction—initially positive, later negative. Google’s AMP project is one example. It sounded okay to me at first. But as I got into the details, its fundamental unfairness couldn’t be ignored.

I was fairly neutral on blockchains at first, at least from a technological perspective. There was even some initial promise of distributed data preservation. But over time my opinion went down, down, down.

Bitcoin, with its proof-of-work idiocy, is the poster-child of everything wrong with the reality of blockchains. The astoundingly wasteful energy consumption is just staggeringly pointless. Over time, any sufficiently wasteful project becomes indistinguishable from evil.

Speaking of energy usage…

My feelings about large language models have been dominated by two massive elephants in the room. One is the completely unethical way that the training data has been acquired (by ripping off the work of people who never gave their permission). The other is the profligate energy usage in not just training these models, but also running queries on the network.

My opinion on the provenance of the training data hasn’t changed. If anything, it’s hardened. I want us to fight back against this unethical harvesting by poisoning the well that the training data is drawing from.

But my opinion on the energy usage might just be swaying a little.

Michael Liebreich published an in-depth piece for Bloomberg last month called Generative AI – The Power and the Glory. He doesn’t sugar-coat the problems with current and future levels of power consumption for large language models, but he also doesn’t paint a completely bleak picture.

Effectively there’s a yet-to-decided battle between Koomey’s law and the Jevons paradox. Time will tell which way this will go.

The whole article is well worth a read. But what really gave me pause was a recent piece by Hannah Ritchie asking What’s the impact of artificial intelligence on energy demand?

When Hannah Ritchie speaks, I listen. And I’m well aware of the irony there. That’s classic argument from authority, when the whole point of Hannah Ritchie’s work is that it’s the data that matters.

In any case, she does an excellent job of putting my current worries into a historical context, as well as laying out some potential futures.

Don’t get me wrong, the energy demands of large language models are enormous and are only going to increase, but we may well see some compensatory efficiencies.

Personally, I’d just like to see these tools charge a fair price for their usage. Right now they’re being subsidised by venture capital. If people actually had to pay out of pocket for the energy used per query, we’d get a much better idea of how valuable these tools actually are to people.

Instead we’re seeing these tools being crammed into existing products regardless of whether anybody actually wants them (and in my anecdotal experience, most people resent this being forced on them).

Still, I thought it was worth making a note of how my opinion on the energy usage of large language models is open to change.

But I still won’t use one that’s been trained on other people’s work without their permission.

Thursday, November 21st, 2024

CCC | Ban tracking and personalised advertising

YES! THIS!!!

A ban on tracking-based personalised advertising will provide an incentive to reinforce sustainable alternative models and, in fact, will be a condition for making them viable. The advertising industry already has sustainable, proven concepts for effective online advertising that do not require targeted tracking and personalisation (e.g. contextual advertising).

Tuesday, November 12th, 2024

1 dataset. 100 visualizations.

The same small dataset visualised in a hundred different ways, with notes on the strengths and weaknesses of each one.

Saturday, November 9th, 2024

Optimism

I think of myself of as an optimist. It makes me insufferable sometimes.

When someone is having a moan about something in the news and they say something like “people are terrible”, I can’t resist weighing in with a “well, actually…” Then I’ll start channeling Rutger Bregman, Rebecca Solnit, and Hans Rosling, pointing to all the evidence that people are, by and large, decent. I should really just read the room and shut up.

I opened my talk Of Time And The Web with a whole spiel about how we seem to be hard-wired to pay more attention to bad news than good (perhaps for valid evolutionary reasons).

I like to think that my optimism is rational, backed up by data. But if I’m going to be rational, then I also can’t become too attached to any particualar position (like, say, optimism). I should be willing to change my mind when I’m confronted with new evidence.

A truckload of new evidence got dumped on my psyche this week. The United States of America elected Donald Trump as president. Again.

Even here I found a small glimmer of a bright side: at least the result was clear cut. I was dreading weeks or even months of drawn-out ballot counting, lawsuits and uncertainty. At least the band-aid was decisively ripped away.

Back in 2016, I could tell myself all sorts of reasons why this might have happened. Why people might have been naïve or misled into voting a dangerous idiot into power. But the naïveté was all mine. The majority of America really is that sexist.

This feels very different to 2016. And hey, remember when we woke up to that election result and one of the first things we did was take out subscriptions to the New York Times and the Washington Post to “support real journalism”? Yeah, that worked out just great, didn’t it?

My faith in human nature is taking quite a hit. An electoral experiment has been run three times now—having this mysogistic racist narcissistic idiot run for the highest office in the land—and the same result came up twice.

I naïvely thought that the more people saw of his true nature, the less chance he would have. When he kept going off-script at his rallies, spouting the vilest of threats, I thought there was an upside. At least now people would see for themselves what he’s really like.

But in the end it didn’t matter one whit. Like I said in a different context:

To use an outdated movie reference, imagine a raving Charlton Heston shouting that “Soylent Green is people!”, only to be met with indifference. “Everyone knows Soylent Green is people. So what?”

I never liked talking about “faith” in human nature. To me, it wasn’t faith. It was just a rational assessment. Now I’m not so sure. Maybe I need some faith after all.

I wonder if my optimism will return. It probably will (see? I’m such an optimist). But if it does, perhaps it will have to be an optimism that exists despite the data, not because of it.

Sunday, October 20th, 2024

Friday, September 27th, 2024

Capt. Grace Hopper on Future Possibilities: Data, Hardware, Software, and People (Part One, 1982) - YouTube

Wow! Grace Hopper has always been a hero to me, but I had no idea she was such a fantastic presenter. She’s completely engaging, with the timing and deadpan delivery of a stand-up comedian at times.

Thursday, September 26th, 2024

The datalist element on iOS

The datalist element is good. It was a bit bumpy there for a while, but browser implementations have improved over time. Now it’s by far the simplest and most robust way to create an autocompleting combobox widget.

Hook up an input element with a datalist element using the list and id attributes and you’re done. You can even use a bit of Ajax to dynamically update the option elements inside the datalist in response to the user’s input. The browser takes care of all the interaction. If you try to roll your own combobox implementation, it’s almost certainly going to involve a lot of JavaScript and still probably won’t account for all use cases.

Safari on iOS—and therefore all browsers on iOS—didn’t support datalist for quite a while. But once it finally shipped, it worked really nicely. The options showed up just like automplete suggestions above the keyboard.

But that broke a while back.

The suggestions still appeared, but if you tapped on one of them, nothing happened. The input element didn’t get updated. You had to tap on a little downward arrow inside the input in order to see the list of options.

That was really frustrating for anybody on iOS using The Session. By far the most common task on the site is searching for a tune, something that’s greatly (progressively) enhanced with a dynamically-updating datalist.

I just updated to iOS 18 specifically to see if this bug has been fixed, and it has:

Fixed updating the input value when selecting an option from a datalist element.

Hallelujah!

But now there’s some additional behaviour that’s a little weird.

As well as showing the options in the autocomplete list above the keyboard, Safari on iOS—and therefore all browsers on iOS—also pops up the options as a list (as if you had tapped on that downward arrow). If the list is more than a few options long, it completely obscures the input element you’re typing into!

I’m not sure if this is a bug or if it’s the intended behaviour. It feels like a bug, but I don’t know if I should file something.

For now, I’ve updated the datalist elements on The Session to only ever hold three option elements in order to minimise the problem. Seeing as the autosuggest list above the keyboard only ever shows a maximum of three suggestions anyway, this feels like a reasonable compromise.

Sunday, June 9th, 2024

DOC • The power of beauty in communicating complex ideas

As designers creating images to communicate complex ideas, we rationalize our processes, we bring objectivity to our craft, we want our clients to think that our decisions are based on reasoning. However, we should also defend our intuitions, our subjectivity. We should also defend pursuing beauty as it is one of our most powerful tools.

Tuesday, March 19th, 2024

A microdata enhanced HTML Webcomponent for Leaflet | k-nut — Blog

Here’s a nice HTML web component that uses structured data in the markup to populate a Leaflet map.

Personally I’d probably use microformats rather than microdata, but the princple is the same: progressive enhancement from plain old HTML to an interactive map.

Tuesday, March 5th, 2024

The global fight against polio — how far have we come? - Our World in Data

I think it’s always worth revisiting accomplishments like this—it’s absolutely astounding that we don’t even think about polio (or smallpox!) in our day-to-day lives, when just two generations ago it was something that directly affected everybody.

The annual number of people paralyzed by polio was reduced by over 99% in the last four decades.

Wednesday, January 3rd, 2024

Historical Trails

Maggie explores different ways of visualising journeys on the web, including browser histories:

Perhaps web browsing histories should look more like Git commit histories? Perhaps distinct branches could representing different topics and research avenues?

A memex in every web browser!

Older »