Voice recognition with Emacs

2024/01/10

... because Emacs is the best for quick hacks to integrate with basically anything. Originally from 2023/09/03, edited a bit now because... we shall start posting again!

Enter Whisper

Being able to just talk to computers and have them understand what you're saying is something that I've been thinking about for a long time. There used to be systems that did work somewhat, but they were Windows-only, expensive, and hard to integrate with anything.

At some point though, rather abruptly, OpenAI released Whisper, a basically open-source voice recognition model that happens to be remarkably good. Good to the point of... you stop thinking about how good it is since it never really presents an issue. You can run it locally if you have a reasonably beefy machine, or, alternatively, they also provide an online service API that is dirt cheap if you are not planning on talking to it all day.

On the other hand, the current user interface is not especially great. They don't even have a good command line tool that you could just throw an mp3 file at, let alone something that that you could use interactively to save yourself from a bunch of typing.

Wouldn't it be great though to have a nice text editor we could integrate it into? One that is relatively easy to extend?

The Emacs way

What we would want is a button which, when pressed, would start recording your voice; when you press it again, it could inject it into the current buffer. (In case you were wondering: buffer is Emacs lingo for basically an open file.)

So what is the most hacky, simplistic way you can imagine making this happen? Let's just use using ffmpeg to record some sound in the background when the button is pressed. When it's pressed again, we just interrupt ffmpeg and pick up the file it has produced.


(defun ssafar--launch-voice-ffmpeg ()
  (unless ssafar-openai-token
    (error "No API token available"))
  (let* ((proc (start-process-shell-command
                "voice ffmpeg" "*ffmpeg*"
                "ffmpeg -y -t 60 -f pulse -i default /home/simon/voice_temp.mp3")))
    (setq ssafar-voice-ffmpeg-process proc)
    (message "recording...")
    (set-process-sentinel proc 'ssafar--voice-ffmpeg-sentinel)))

We put up a 60 second upper limit for how long it can run. This is typically enough for dictating a couple of sentences, but prevents it from accidentally recording 3 hours and then sending it out to OpenAI for processing.

The above code launches the process in the background; it also saves the reference to the running process to "ssafar-voice-ffmpeg-process"... which is just a global variable, essentially.

As for what the actual button does...


(defun ssafar-launch-or-commit-voice ()
  (interactive)
  (if (not ssafar-voice-ffmpeg-process)
      (ssafar--launch-voice-ffmpeg)
    (progn
      (interrupt-process ssafar-voice-ffmpeg-process)
      ;; technically, it might not be dead yet, but... eventually.
      )))

We just check whether our global variable for FFmpeg process is set or not. If this hasn't been set yet, we launch a new process. Meanwhile, if it exists already, we interrupt it & clean it up.

Background processes in Emacs can have so-called sentinels, which are functions that get called when anything changes in the processes state, for example it dies, finishes, and so on. Here's ours.


(defun ssafar--voice-ffmpeg-sentinel (proc event)
  ;; reset the variable
  (setq ssafar-voice-ffmpeg-process nil)
  (cond
   ;; interrupt is ok... why 255 is there is a question, but... oh well
   ((memq (process-exit-status proc) '(2 255))
    (message "finished; calling whisper...")
    (ssafar-voice-call-whisper)) ;; calling the actual service if everything went well with ffmpeg
   (t (message "... something went wrong with the process"))))

As for the service call:


(defun ssafar-openai-transcription (audio-path)
  "Send an audio file to OpenAI for transcription.
AUDIO-PATH should be the full path to the audio file."
  (unless ssafar-openai-token
    (error "No API token available"))

  (let ((command (format "curl -s https://api.openai.com/v1/audio/transcriptions \
-H \"Authorization: Bearer %s\" \
-H \"Content-Type: multipart/form-data\" \
-F file=\"@%s\" \
-F model=\"whisper-1\""
                         ssafar-openai-token
                         audio-path)))
    (json-read-from-string
     (let ((default-directory "~")) ;; so that it doesn't try doing stuff with tramp
       (shell-command-to-string command)
       ))))

Given how Whisper wants a multi-part post request, we don't really bother using the built-in Emacs version. Just call out to curl directly.

The actual code that calls this gives it the mp3 file that was recorded by ffmpeg, and calls a callback once done.


(defun ssafar-voice-call-whisper ()
  "Sends the current output file to whisper"
  (setq ssafar-last-voice-output
        (ssafar-openai-transcription
         "/home/simon/voice_temp.mp3"))
  (let ((the-text  (cdr (assoc 'text ssafar-last-voice-output))))
    (if ssafar-current-voice-callback
        (funcall ssafar-current-voice-callback the-text)
      (message the-text))))

This callback was set up when we first launched the enter process and it depends on whether we are in Emacs proper or some other X window. (As it happens, Emacs is also my window manager this time, but it would probably also work without, as long as we get some global hotkeys routed to it.) This way, if we're in an actual Emacs buffer, we inject the text; otherwise, we emulate keypresses that e.g. a browser can catch, so that all this works even if what we're looking at is another, non-Emacs app.


(defun ssafar--voice-callback-dwim (text)
  "Do something reasonable depending on the mode we're in"
  (cond
   ((derived-mode-p 'exwm-mode)
    (start-process-shell-command
     "xdotool" nil nil
     (format "xdotool type --clearmodifiers %s"
             (shell-quote-argument text))))
   (t (insert text)))
  (message "(whisper done)"))

As a final touch, you can make all this happen while you hit a particular key.


(global-set-key (kbd "") 'ssafar-launch-or-commit-voice)

Conclusion

Is this the optimal way of doing this? Probably not. On the other hand, the total amount of code required to do this is really not a lot, and compared to just a plain shell script, this actually gives some feedback to you about which state the process is in. For example, you do get the little "recording..." notification message while it's running; I did find this fairly useful instead of trying to figure out whether I have pressed the button an odd or an even number of times.

There are plenty of opportunities to improve on this: for example, currently they cannot handle additional recorded speech while processing the previous chunk. Also, ffmpeg does have a habit of splitting off the last second of audio when interrupted, apparently. (Might be just my fault of me interrupting it in a stupid way though.)

Nevertheless, as it turns out, having a voice recognition engine tightly integrated into your text editor is immensely useful. For example, this article was written mainly using this tool, making it way faster to finish. (As it turns out, most humans are way better at talking than typing, even if they are pretty good at typing.)

... comments welcome, either in email or on the (eventual) Mastodon post on Fosstodon.