MP3 files can contain text, of course, and I’ve occasionally found lyrics stored inside TEXT and USLT frames. But there’s no consistency at all, probably never will be – more likely to find spam inside a TEXT frame.

Your idea for linking to time points is a cool notion, Lucas. Related to this, Real’s servers provide for a “start” parameter on a/v URIs, allowing one to jump to a time point, e.g.

http://play.rbn.com/?url=demnow/demnow/demand/2009/dec/audio/dn20091231.ra&proto=rtsp&start=00:28:56

Some of the various SMIL specs provide begin and end params for the same purpose (http://is.gd/5I3jL). Aside from that and Real’s faded format, my hunch is that most a/v is not very content-addressable, partly due to the fact that a given song can be found in the wild with many encoding variations. If I make in/out time points for lyrics on my rip of a CD track, your rip might not sync with it. Also, radio vs. album versions of a song may vary in duration and content.

Event-based synchronization, i.e. the beat-counting idea Piers brings up, might be worth looking into-

<a href=”example.mp3#t=1017b,1683b” class=”chorus”>chorus</a>

This would need a filter to recognize beats and count them. Possible, just not as simple as time. Might be more consistent than seconds-based.

Perhaps there’s another type of common event found in audio streams that could provide consistency, but I like drum beats because they’re less likely to get corrupted or folded than high frequencies, and less common than human voice-range freqs.

The karaoke industry seems to have cracked this nut, but I’m gonna hazard a guess that it’s all proprietary.

These guys sell player sw that syncs lyrics for 1 million songs, they claim: http://is.gd/5I48w . They appear to target music teachers in their marketing.