You can always use WebVTT with @kind=metadata to provide music notation in chunks (“cues”) to the audio file and then use JavaScript to interpret the result. If you wanted to put binary blobs into WebVTT cues, then your JavaScript would need to know how to present them and how to switch between then. If it’s for example just pngs, then you could use data urls.