Simplicity and pragmatism tend to bear out, the least end-user effort wins.
To recollect some of this, the binding of lyrics and other metadata to timed points in a song is complicated by encoding and editing changes to a track. There is no one-dimensional identifier that isn’t subject to morphing when a track is ripped, re-encoded or otherwise messed about with.
Discarding timecode as a bind-point, the next thing that comes to mind is the fingerprinting of concurrent frequencies within a frame or time period. Frequency analysis is something that players already do, for equalization and visualization and other things.
Say there was a way to abstract the frequencies in a given frame or snippet of audio, such that we could reduce a 0.1 second sound slice to a few numbers:
[881.2hz, 220hz, 4150hz, 338hz]
Adding in the amplitude of each frequency as a percent of the source medium’s dynamic range:
[881.2hz@25%, 220hz@40%, 4150hz50%, 338hz@76%]
Would that be enough information to provide a unique location fingerprint for a point in a song? Only experimentation could tell just how unique these characterizations are. I know of libraries and applications which provide frequency analysis methods (e.g. Squeak.)
Say that frequency spans 0Hz-16KHz, a range that fits nicely into 14 bits. Amplitude, expressed in decibels or % of max, another 6 bits. So, three bytes and change to describe a given frequency sample.
With a pipeline of frequency@amplitude sets, a plugin would need lyrics to be indexed by unique set, so it would make sense to sort the freq@amp values in each set coming down the pipe:
[220hz, 338hz, 881hz, 4150hz]
[105hz, 240hz, 881hz, 2300hz, 4150hz]
[95hz, 262hz, 881hz, 2300hz, 4150hz]
…
The first thing to do is experiment, reduce test tracks to streams of freq@amp sets, and evaluate whether uniqueness exists.
Just some ideas… Hey, I know a developer who toys with digital audio for a living, he introduced me to Squeak. Gonna have a word with that boy and see what he knows.