Comment on “jwheare’s web of music and the Media URI spec” by Kevin Prichard:
MP3 files can contain text, of course, and I’ve occasionally found lyrics stored inside TEXT and USLT frames. But there’s no consistency at all, probably never will be – more likely to find spam inside a TEXT frame.
Your idea for linking to time points is a cool notion, Lucas. Related to this, Real’s servers provide for a “start” parameter on a/v URIs, allowing one to jump to a time point, e.g.
http://play.rbn.com/?url=demnow/demnow/demand/2009/dec/audio/dn20091231.ra&proto=rtsp&start=00:28:56
Some of the various SMIL specs provide begin and end params for the same purpose (http://is.gd/5I3jL). Aside from that and Real’s faded format, my hunch is that most a/v is not very content-addressable, partly due to the fact that a given song can be found in the wild with many encoding variations. If I make in/out time points for lyrics on my rip of a CD track, your rip might not sync with it. Also, radio vs. album versions of a song may vary in duration and content.
Event-based synchronization, i.e. the beat-counting idea Piers brings up, might be worth looking into-
<a href=”example.mp3#t=1017b,1683b” class=”chorus”>chorus</a>
This would need a filter to recognize beats and count them. Possible, just not as simple as time. Might be more consistent than seconds-based.
Perhaps there’s another type of common event found in audio streams that could provide consistency, but I like drum beats because they’re less likely to get corrupted or folded than high frequencies, and less common than human voice-range freqs.
The karaoke industry seems to have cracked this nut, but I’m gonna hazard a guess that it’s all proprietary.
These guys sell player sw that syncs lyrics for 1 million songs, they claim:
http://is.gd/5I48w
. They appear to target music teachers in their marketing.
When you think about it, a technological component in a media player can auto-magically beat-sync two tracks by comparing basic structure and determining BPM. Word documents used to be the bane of the structured data movement, because they trapped content in a non-structured format, but ODF and OOXML have changed that game completely, creating a new class of semi-structured data; so why not music or video?
It’s fascinating to consider that if more artists released works under CC-NC by attribution, remix artists could provide additional value by micro-tagging individual samples within the deeper structure of their compositions – particularly if this functionality were baked into the software used to assemble the composition.
It seems to me that the magic is to find a way to refer to semantic elements of a song, both of which are abstract things that only loosely map to specific bytes in a specific file.
Simplicity and pragmatism tend to bear out, the least end-user effort wins.
To recollect some of this, the binding of lyrics and other metadata to timed points in a song is complicated by encoding and editing changes to a track. There is no one-dimensional identifier that isn’t subject to morphing when a track is ripped, re-encoded or otherwise messed about with.
Discarding timecode as a bind-point, the next thing that comes to mind is the fingerprinting of concurrent frequencies within a frame or time period. Frequency analysis is something that players already do, for equalization and visualization and other things.
Say there was a way to abstract the frequencies in a given frame or snippet of audio, such that we could reduce a 0.1 second sound slice to a few numbers:
[881.2hz, 220hz, 4150hz, 338hz]
Adding in the amplitude of each frequency as a percent of the source medium’s dynamic range:
[881.2hz@25%, 220hz@40%, 4150hz50%, 338hz@76%]
Would that be enough information to provide a unique location fingerprint for a point in a song? Only experimentation could tell just how unique these characterizations are. I know of libraries and applications which provide frequency analysis methods (e.g. Squeak.)
Say that frequency spans 0Hz-16KHz, a range that fits nicely into 14 bits. Amplitude, expressed in decibels or % of max, another 6 bits. So, three bytes and change to describe a given frequency sample.
With a pipeline of frequency@amplitude sets, a plugin would need lyrics to be indexed by unique set, so it would make sense to sort the freq@amp values in each set coming down the pipe:
[220hz, 338hz, 881hz, 4150hz]
[105hz, 240hz, 881hz, 2300hz, 4150hz]
[95hz, 262hz, 881hz, 2300hz, 4150hz]
…
The first thing to do is experiment, reduce test tracks to streams of freq@amp sets, and evaluate whether uniqueness exists.
Just some ideas… Hey, I know a developer who toys with digital audio for a living, he introduced me to Squeak. Gonna have a word with that boy and see what he knows.
I just had a look at http://en.wikipedia.org/wiki/Acoustic_fingerprint
It seems that existing fingerprinting methods split between identifying an audio source as a whole, and just an audio fragment. Gonna look deeper into this…
I’ve been pointed by my friend to openframeworks, says it has FFT capability. There’s some discussion on FFT usage here-
http://www.openframeworks.cc/forum/viewtopic.php?f=10&t=2184&view=unread
Quite possibly worth a quick hack, just to know if there’s anything to this!
Update: I’ve built a rough little C++ prog that generates the frequency sets using the OF FFT method. More work to go before I’ll know if there’s any unique aspect to this data.
Maybe this is an all-around silly idea that those with more digital audio experience would get right away, but I do enjoy experimenting.
Sorry to be so slow to respond, Kev.
I love the idea of identifying locations in a song via a frequency table mapped to some window.
Though, I think that the endpoint of this line of dev is to search for an arbitrary acoustic fingerprint within a song and call the location wherever it’s found the timecode.
The concept would be: here’s an acoustic fingerprint for a five second snippet, including patterns in frequency and amplitude and anything else you can find.
One way to identify frequency and amplitude patterns is to treat the acoustic data as if it were a face and use face recognition algorithms, e.g. principle components analysis.
A simpler method is to measure variance in some dimension over a fixed time window. So say that the identifier for a time segment is the difference between the highest and lowest amplitude.
It’s fun hacking. Makes me wish, yet again, that I did comp sci grad school.
p.s. caution: this area of work is heavily patented.
Good info, Luc. I keep hitting patent pages while searching about this area, it’s a bit of a minefield. I don’t suppose making it a FOSS offering would change matters, hm?
I switched to fftw – it’s a tiny library compared with openframeworks, with super-fast execution, plus it handles n-dimensional transforms.
The idea of jumping up an abstraction level to the methods used in facial recognition – intriguing. I’ll hafta look into that.
“…the identifier for a time segment is the difference between the highest and lowest amplitude.”
Another interesting idea, I’ll add it to my test case list. First, I gotta get my code working under fftw.
Another cool thing about fftw, it does runtime precompilation of an analysis – the initial compilation takes hundreds of millis (depending on CPU), and then the actual FFT transforms on live samples take microseconds after that.
I’m gonna keep chipping away at this til I got some answers… just curious.
Took me awhile to get to it, but I’ve written a C app that decodes an MP3 file into a stream of CD audio-type samples, and delivers that to a FFT preprocessor. Working on the FFT in and out part now, will post back when I finish that bit.