Simon: Open-Source Speech Recognition: 2012

Sonntag, 30. Dezember 2012

Simon 0.4.0

After years of hard work, the Simon team is proud to announce the new major release: Simon 0.4.0.

New in Simon 0.4

This new version of the open source speech recognition system Simon features a whole new recognition layer, context-awareness for improved accuracy and performance, a dialog system able to hold whole conversations with the user and more.

Revisiting Usability

A lot of work has gone into making Simon easier to use - both for existing and new users.

Perhaps most visibly, the main window of Simon has been reorganized to bring the most important options together in one screen.

Simon 0.4.0: Main window

Moreover, the newly introduced Simon base model format (.sbm) and the integration of a GHNS online repository of base models have removed the last big hurdle of the initial configuration.
One can now easily go from a fresh installation to a working setup in less than 5 minutes without any preparation. Don't believe me? Check out the quick start below!

Simon 0.4.0: Quick Start

Many other, smaller changes sum up to one simple but important difference: Simon will overall require less user interaction while achieving more.

SPHINX

One of the major internal changes of Simon 0.4 is of course the included support for the BSD licensed CMU SPHINX. While we still also maintain full support for HTK and Julius, new models compiled with Simon will default to the SPHINX backend and the (proprietary) HTK is no longer required to build user-generated models.
Best of all: Simon will select the correct backend for your configuration transparently and automatically.

Voxforge

A major problem of open source speech recognition has always been the lack of freely available high quality speech models.

The Voxforge project has been working for years towards GPL acoustic models for a variety of languages. While their models are certainly not yet perfect, they offer a promising starting point.
The English Voxforge model is of course available as a Simon base model and can be downloaded and imported with Simon.

Additionally, starting with Simon 0.4, users will also have the option to contribute their gathered Simon training samples directly to the Voxforge server.
These recordings will then be used to train and improve the general acoustic models.

Simon 0.4.0: Training

By the way: Behind the scenes this upload is based on SSC.

Context

There is a simple rule of thumb in speech recognition: The smaller the application domain, the better the recognition accuracy. This was always one of the core principles of Simon.

In Simon 0.4, however, we went one step further: Simon can now re-configure itself on-the-fly as the current situation changes. Through so called "context conditions" Simon 0.4 can automatically activate and deactivate selected scenarios, microphones and even parts of your training corpus.

For example: Why listen for "Close tab" when your browser isn't even open? Or why listen for anything at all when you're actually in the next room listening to music? Yes, Simon is watching you.

Simon 0.4.0: Context awareness

Dialog System

Simon 0.4.0 also ships with the new dialog system featuring scripted variables (Javascript), integration with Plasma data engines, a templating system and - of course - text-to-speech output.

Simonoid

For users of KDE's plasma workspace, we now provide the "Simonoid" plasmoid to start and monitor Simon - including the current recording volume.

Simonoid

The screenshot above shows two instances of the plasmoid: One added to the panel and another one to the desktop.

... and everything else

Please don't be foold to think that the above is a complete list of all improvements. For example, we also have a new sample review tool called Afaras, integration with the Sequitur grapheme to phoneme framework, an Akonadi command plugin and many, many other noteworthy changes.
You'll have to try out Simon to see for yourself!

Download

To install Simon 0.4.0, you can either compile the official source tarball, install a binary package provider by our Linux distribution or use the installer for Windows.

Microsoft Windows: Installer

Source Code

If you are a packager and would like to package Simon 0.4, please do get in touch with us. Thank you.

Donnerstag, 20. Dezember 2012

Simon 0.4.0: RC1

As of right now, the first release candidate of Simon 0.4.0 is available:

As can be expected, a lot of bugs have been fixed since the last beta.

However, that's not all that changed: This release candidate also comes with complete handbooks for all Simon applications. Next to documenting the plentiful new features, I also completely restructured the Simon handbook to hopefully provide a better starting point for new users.

Moreover, the windows installer has been vastly improved and now actually ships a fully fledged Simon version with Julius and SPHINX support. This means that you can use any Simon base model right from the start and even build your own speech model from scratch without installing any additional software.

This brings me neatly to the call for packagers: If you want to help package Simon, please get in touch with me.

Sonntag, 9. Dezember 2012

Simon 0.4: Beta 2

Yes, the second beta of Simon 0.4.0 is here!

Most of the changes are quite minor but the SPHINX backend received some much needed love and Simon now builds again under Windows.

A Windows installer of the beta version is currently being built and will be provided shortly.

Update: The Windows installer is now available.

Mittwoch, 5. Dezember 2012

Simon 0.4: Beta 2 Update

The second beta of Simon was scheduled for today but since the last couple of weeks were extremely busy for me and other core contributers to Simon, we simply don't have enough changes that would justify another release right now.

Therefore, I have decided to move the release of the second beta to Sunday, December 9th.
Everything else, most importantly the release date of 0.4.0, stays the same.

Please do keep reporting bugs in the mean time - I am sure there are lots of them and yet we only received two reports since the first beta. Come on, guys, you can do better than that!

Donnerstag, 15. November 2012

Simon 0.4.0: Beta 1

It is November 15 and therefore time for our first beta release of Simon 0.4 (tagged as Simon 0.3.90).

Download the source tarball here: Simon 0.4.0: Beta 1

Please note that this release is purely for testing purposes and not meant to replace Simon 0.3.0 on production systems yet.

Don't forget to report bugs!

Dienstag, 13. November 2012

Simon in Brazil

Amidst all the release preparation it can be helpful to reflect why we're actually working on all of this.

Because of that, I want to share an email about a project of the Federal University of Pará, Brazil that I received just a couple of days ago:

The project began in January, 2012 with the goal of deploying SimonBR in public schools in Castanhal, a city located in the state of Pará, Brazil.
The idea is to enable students with physical disabilities to use the computer to perform basic tasks such as listening to music, view photos, access the Internet, email, etc.

And here is a demonstration video of what they have built:

(http://www.youtube.com/watch?v=FlfuACZu-Gc)

Please help us test the next version of Simon and report any bugs that you find on our bug tracker to make sure that projects like that will become more common in the future!

Freitag, 2. November 2012

Winter is here

Yesterday was the official feature freeze of Simon 0.4 which means that after years of summer, winter has finally descended on trunk.

As is par for the course, a lot of last minute cramming ensued but I'm happy to report that all the major planned features have made it in. I'll try to blog about some of the most important ones a little bit leading up to the release.

The feature freeze also means that we are now concentrating on fixing bugs. And we need your help!
With software as complex as Simon, there are a lot of corner cases that are hard to find by us developers alone. Compiling Simon is easy, though, so please give it a go and report any bugs that you ~~might~~ will find to make Simon 0.4 as stable as it should be.

Finally, today is also our string freeze. While we're still busy hacking away at all those messages at the moment, starting at midnight, there will be no new strings barring any unforeseen circumstances.
If you can help translate Simon into another language and need help to get started, just get in touch with us. We really appreciate it!

Update: Unforseen circumstances :). We'll sadly have to break the string frezeze one last time as a rather large change set (renaming simon -> Simon, etc.) was not ready in time. It's being commited right now, though.. Sorry for the inconvenience.

Freitag, 12. Oktober 2012

Simon 0.4: Start the Countdown

Exactly 25 months ago, the last stable release of Simon, 0.3.0, was released to the public.

Since then, more than 1100 changes were committed introducing tons of new features, fixing countless bugs and completely revamping the user interface.

In fact, more than 40 % of all commits to the Simon code base have happened after the release of 0.3.0.

Today, I want to finally announce the release schedule for Simon 0.4.0:

Date	Event
2012-11-1	Feature Freeze
2012-11-5	Soft Message Freeze
2012-11-15	Tag and Release: Simon 0.4.0 Beta 1
2012-12-5	Tag and Release: Simon 0.4.0 Beta 2
2012-12-20	Hard Message Freeze; Documentation Freeze
2012-12-21	Tag and Release: Simon 0.4.0 Release Candidate
2012-12-28	Tag: Simon 0.4.0 Final
2012-12-30	Release: Simon 0.4.0 Final

This will be the first release after becoming an official KDE project - let's make it a great one!

Sonntag, 30. September 2012

Simon at Randa 2012

As some of you might already know, I spent last week at KDEs annual Randa Mettings in the beautiful city of Randa, Switzerland.
It was my first sprint and I was genuinely surprised at how unbelievably productive it really was. It's amazing what a couple of committed developers can achieve in just a couple of days.
The awesome food and plentiful supply of Swiss chocolate doesn't hurt, of course.

This sprint was the first time I got to meet José Millán Soto and Amandeep Singh who are working on AT-SPI with the help of Frederik Gladhorn, who was also there, and Sebastian Sauer, who sadly could not make it. Alejandro Piñeiro from the GNOME Accessibility team also joined the hackfest to provide valuable insights in the GNOME a11y stack and Yash Shah, one of Simons GSOC students this year, flew in once more to work on the computer vision aspects of Simons context layer.

Watch the Planet(s) for updates about all the great work those guys have been doing.

DBus Context Conditions

To warm up, on the first day I tackled a couple of items that were on my todo list for some time already. One of these tasks was to finish the implementation of the DBus context condition plugin.

Through the DBus condition, the Simon context layer can more accurately reflect the current state of other applications.

The main benefit of this feature is to allow application developers that write software that is specifically meant to be voice controlled to dynamically configure Simon to their softwares needs. By exposing the state of their system over DBus, Simon can react by activating and deactivating commands, vocabulary, grammar, microphones or sample groups as needed.

However, not only custom-written solutions gain benefit from DBus conditions: The screenshot above, for example, configures Simon to deactivate itself while VLC is playing something. Ever wanted to disable Simon automatically while listening to music? Now you can.

AT-SPI

The big topic of the week was of course AT-SPI: Through AT-SPI, assistive technologies like screen readers can "see" running applications, follow the focus and react on changes. Traditionally, KDE 4 provided no real support for this and was therefore largely inaccessible for the large group of users that rely on such technology.

In recent years, however, there has been a lot of work to complete the AT-SPI support in Qt and KDE and thanks to the relentless work of people like Frederik Gladhorn, the situation is already much improved. Screen readers are starting to work with KDE software to some extend and overall the AT-SPI framework (qt-atspi) is becoming more complete and stable every day.

In Simon, AT-SPI can be used to automatically parse and control applications without any prior configuration by the end-user. A prototype was already implemented last summer.

While writing this plugin, I used the AT-SPI bus directly and noticed significant differences between Qt and GTK in the way they represented widgets in AT-SPI. The plugin therefore needed a lot of code just to maintain the internal view of the focused application - a problem that is shared with other a11y clients as well.
With the introduction of QAccessibilityClient, a new client library to aid developers of AT-SPI clients (assistive software), this simply didn't make sense anymore and a rewrite was in order.

Actions

Next to exposing information about widgets, AT-SPI also provides a way to interact with them through Actions. In Simon these will be associated with saying the name (e.g. text of a button) of the widget in question.

Because Simons AT-SPI plugin is the first real benefactor of this technology, many popular widgets don't yet expose proper actions - but Amandeep and José are fixing those problems left and right.

Selecting one of two available actions for an activated tab

At the Randa sprint, we also had a very productive meeting to discuss broader issues like the handling of default actions and custom actions at a toolkit level.

Performance

The AT-SPI plugin parses the currently focused window, builds vocabulary and grammar and then triggers the synchronization to build a new, active model to reflect the changes. The problem with this is that this might happen every other second in practice because of the user opening context menus or dialog, changing button texts, etc.

This imposes major performance problems - especially because users don't want to wait a couple of seconds after opening a popup menu to say the next command.

While many comparatively simple performance improvements over the old prototype were implemented - like moving the AT-SPI watcher to a separate thread - some changes are not limited to the AT-SPI plugin but also improve Simons performance in general.

Simond Synchronization 2.0

Simon communicates with the Simond server over a custom TCP protocol. The speech model components are synchronized over the network. As soon as the input data changes, a new model is generated (or loaded from the cache).

This synchronization protocol was originally developed for Simon 0.2 and introduced a significant bottleneck: Each data element (individual scenarios, training data, language model files, etc.) would be synchronized separately. This involved the server querying the client for the modification date and then either requesting the clients version, sending its own version to the client or moving on to the next component if they were already up-to-date. This took at least one full round trip per component - even if all components were already up to date.

To make the synchronization more efficient, a new synchronization protocol was defined: The client now announces the modification date of all components in its initial synchronization offering. The server then builds the synchronization strategy based on that information and, because all requests are now essentially stateless, requests or sends the components that need updating on either side asynchronously.

Caching³

Simond already had a powerful model cache, but it only kept previously built models around as long as they could be the result of a specific (context) situation. The AT-SPI plugin instead modifies the scenario itself.

Simon has no way of knowing if the same set of components will ever be shown again, but it is in general a safe assumption (e.g.: Showing a menu and closing it again returns to the same state as before opening the menu). To address this, the scenario cache was modified to also keep "abandoned" (unreachable) models available for some time. Right now, the 15 most recently abandoned models are kept in this cache.

Error handling

Automatically building the active vocabulary depending on visible controls posed another problem: Even the English Voxforge model, one of the most complete open source speech models, does sadly not cover all triphones that crop up when used with such a diverse and dynamic dictionary.

Missing triphone: Before

So more often than not, users would be presented with the dreaded Julius error that is probably familiar with most people that tried to build a scenario for an existing base model once.

The only proper fix for this issue is of course to improve the Voxforge base model to cover all available triphones.

Until this is possible, though, we now work around this issue more gracefully by analyzing the used base model in the adaption layer. Uncovered triphones are then automatically blacklisted and offending words removed from the active vocabulary. That way Simon can still be activated in such situations and only the blocked word(s) can not be recognized.

Missing triphone: Now

To make sure this is transparent to the users, the blacklisted triphones are relayed to the client, which will mark such blocked words with a red background in the vocabulary view.
This replaces the previous simple mechanic that marked all words red that had less than two training samples - something that became obsolete with the introduction of base models.

Conclusion

As I mentioned before - and the avid reader might also have guessed by the length of this post - the Randa sprint was indeed very productive.

I want to thank Mario Fux et al. for organizing this fantastic event and for all sponsors that help make it happen. You guys rock!

Donnerstag, 16. August 2012

Randa Meetings 2012

In only a couple of weeks, this years KDE Randa Meetings will again take place in the Swiss mountain.

It's the first time I get to take part and I'm already very excited.

One of the main working groups this year will be focused accessibility: Six people - including a representative of the GNOME team and two dedicated GSoC students who are flying in from India even after the program has already officially ended - are coming together to create an accessibility infrastructure for free software that truly works.

And if that wouldn't be enough, it's not even "just" accessibility. There will also be teams working on Plasma, Education and Multimedia.

Sadly, not all applicants could be funded this year. But you can help! Please consider a small donation to help cover the expenses for this years sprint in Randa.

Click here to lend your support to: KDE Randa Meetings and make a donation at www.pledgie.com !

Mittwoch, 8. August 2012

Crossroads

Roughly a month ago, I tendered my resignation from the post of vice-chairman of the Simon Listens e.V.
By the end of the week, this will take effect.

Photo by MarkSmallwood

As a founding member this decision did by no means come easy.

However, after careful deliberation I came to the conclusion that my own goals don't line up with the rest of the board anymore.

The remaining board members and I parted on good terms and we continue to stay in contact to discuss new projects as well as current efforts.

What happens now?

To be honest, not a whole lot will change.

I remain fully committed to Simon, its core community and KDE at large.
I will continue to lead Simon development and maintain all components.
I will continue to release new versions of Simon. There will be neither a fork of the codebase nor a change of name or logo.
I will continue to use this blog to keep you updated about new developments.
I will of course also continue to mentor my two great GSoC students :)

The Simon Listens e.V. will continue to take on research- and commercial projects like the Benefit project or the Astromobile project.
In fact, another project under the sponsorship of the Austrian federal ministry of transport and innovation to further the development of the simon-touch voice controlled multimedia station for disabled and elderly people has just started.
The code, of course, remains open under the terms of the GPL and available in KDEs playground repository.
Claus Zotter will keep you updated about this projects progress right here on this blog.

In short: The Simon community remains just as healthy and vibrant as before.

Still, there are a few minor changes to announce:

My personal Ubuntu PPA which used to hold the Simon packages has been replaced by a new team repository of which I am an administrator. The new repository already contains a new addition: A new package of Simon 0.3.0 for Ubuntu 12.04 courtesy of Mark Dammer. Thank you very much!
The Simon Wiki has been moved to KDEs Userbase. The old URL will become a redirect soon.
While I continue to be reachable via my old Simon Listens email address for the time being, please update your address books to me@bedahr.org. Thank you.

Freitag, 15. Juni 2012

GSoC 2011: The Hilarious Aftermath

During last years GSoC I had the pleasure to mentor the very talented Adam Nash on his project to introduce context aware speech recognition in Simon.

After the summer he continued to work on his project until it was ready to be merged with the Simon codebase. Since then he has finished college and, deservedly, started his own promising career.

So imagine my surprise when I received the following email from Adam a couple of days ago:

Hey Peter,

So, I mentioned some sort of "thank you" present before, but private jets were back ordered at the private jet store. I decided, instead, to write you a song that is all about you (probably - I had to guess onsome of the biographical details). Anyways, I attached the mp3 and pasted the lyrics below. I hope you like synthesizers! (because all I have is a bass and a synthesizer :/ ...)

Thanks again for your killer mentorship.
-Adam

And guess what? It turns out that Adam is not only a great coder but also produces hilarious music.

The track and lyrics are below (shared with his permission).

(Direct link)

this guy named peter's super whack
he'll pull your code and push your stack
he's got 1337 skills, he'll pwn your C
he likes accessibility

Compared to his, your code is weak
your hello world's got a memory leak
he sorts his lists in constant time
he's super pimp at code design

The computers are all like:
hey there's peter
he's much sweeter
than those other dudes, yo he's the leader
of the simon listens software
so you hate-ahs better beware
cause he wrestles giant ninja bears
eats sandwiches and combs his hair
I tell ya sometimes he will even share
his sandwiches with giant ninja bears
after he beats them in a wrestling match
It's a voluntary action
He doesn't need to share his sandwiches with anyone
But he does
Call the fuzz
just because
If he was
givin sandwiches out
he's got some splainin to spout
for 20 counts of super-felonies for
aiding and abbedding the ninja bears
at their ninja lairs
with their ninja sensibilities
and sneaky swift abilities
they got responsibilities
to hide their visibilities
assassinate nobilities
they causin volatilities
exposin the fragilities
of upper class facilities
they're also bears

Thank you Adam. Not only doing a great job on your GSoC project, but also for having fun doing it.
It was a pleasure to work with you.

Dienstag, 12. Juni 2012

Akademy 2012

In just a little over two week, the annual KDE conference will start in Tallinn, Estonia.

I'll be there talking about how to build accessible systems:
This includes how to build solutions tailored to people with special needs from the ground up but also how you can make sure that your application can be used by everybody.

See you there!

Montag, 7. Mai 2012

(Simond Model Caching)²

During last years Google Summer of Code, Adam Nash developed a system for letting Simon react to changes in the computing context: Simon can for example change its scenario selection depending on the currently running applications or the name of the active window.

The system has been designed with extendability in mind so that new conditions can be added easily.

Apparently this idea is interesting enough for Apple to try to patent it at the moment.

An easy way to implement context dependence would be to simply deactivate commands when they are not applicable. However, by dynamically creating speech models tailored to the current situation, the recognition rate can be improved considerably.

But creating context dependent speech models leads to a problem: Building models is very time consuming. As the context usually changes very often, the switch between speech models has to be fast.

To compensate, Adam developed a simple caching solution for Simond.
While it worked okay for most use cases, it was a bit buggy and the design had some issues. Because of that, it would have been very hard to switch the model compilation backend (e.g. exchange the HTK with SPHINX).

So during the recent refactoring I also rewrote the context adaption and caching system in Simond.

"So isn't this, like, really easy?"

The premise seems quite straight forward: Whenever the situation changes, try to find the new situation in a cache: If found, use the old model, if not build a new one and add it to the cache.

However, it's not quite as simple: Input files may change very often. However, there are a lot of changes where it's absolutely predictable that the resulting model won't change. Architecturally speaking, this depends on the model creation backend (in this case the HTK) so an independent caching system can't really identify those situations.

The input files may even change during the model creation process.
An example: Someone with a user generated model has two transcriptions for a single word but only training samples for one of them. Because the training data is transcribed on a word level this can only be identified during the model creation. If a (tri)phone of the alternate (unused) transcription is now undefined (untrained), it needs to be removed from the training corpus. Associated grammar structures might now be invalid, etc. Again, this would mean that the caching system has to be integrated with the model creation backend.

But moving the model caching system to the backend isn't a nice solution either as that would mean that each backend would need to implement it's own cache.

"Oh..."

So to enable sensible caching with multiple backends I ended up with an a little bit more complicated, two layered approach:

Model input files would be assigned a unique fingerprint. Source files with the same finger print are guaranteed to produce the same speech model. The finger print is calculated by the model creation backend. This way the calculation can take just those parts of the input files into account that will have an effect on the produced speech model.
In practice this for example means that changing command triggers or adding a grammar sentence with no associated words will produce the same finger print and therefore not trigger the costly re-creation of all associated models.
The current context is be represented through "situations". The cache contains an association between situations and the finger print they will provoke. Multiple situations might share the same finger print (the same speech model). Once a cached model has no situations assigned to it's activation, it will be removed from the cache.

The resulting workflow looks something like this:

To ensure maximum responsiveness, Simond will try to update cached models when the associated input files change. So if you have three situations for your model and add some training data, all three models will be re-evaluated in a separate thread.

The model creation itself uses a thread pool to take advantage of multi-core systems and actually scales very well.

Still, the model creation process can take minutes if you have a lot of training data - even on a decent CPU.

"But what about entirely new situations?"

Creating and maintaining a model cache for all possible situations wouldn't be feasible as the cache would of course grow exponentially with the number of conditions to consider.

To avoid having to wait for the creation of a model for the new situation, the context system was designed to create and maintain the most permissive model available as a fallback.

Let's consider an example: Suppose you have a setup with three scenarios - Firefox, Window management, Amarok - and you configure Simon to activate the Firefox and Amarok scenarios only when the respective applications are running.
The created fallback model would have all three scenarios activated.
Suppose you open and close Firefox quite frequently so those two situations are covered with an up-to-date model. You are currently in the situation that both Firefox and Amarok are closed. Again, there's a model for that. Then you open Amarok for the first time: The correct model would have a disabled Firefox scenario and an activated Amarok scenario.
As the requested model is not available, Simond will now start to compile it. In the mean time, Simond will switch to the fallback model: The one with all scenarios (Firefox, Amarok and the Window Management scenario) activated.

When picking a model to build, the fallback model is given higher priority to ensure that it's (almost) always available.

By the way: Simond sends the compiled speech model back to Simon during synchronization. This is done both to shorten the time it takes the recognition to start in a multi-server environment (think of mobile clients) and to ensure the last successfully compiled model is available in case that the current input files can not be compiled and the client connected to a "fresh" server. Of course only the most fallback model is synchronized to keep the network overhead low.

"But what about ambiguous commands?"

There might be setups where commands have different meanings depending on the context. For example "Find" might have issue "Ctrl+F" in LibreOffice but open Google when issued while browsing the web.

To avoid situations of undefined behavior while the targeted model is compiling, deactivated scenarios are not only removed from the speech model on the Server side but their commands are also disabled on the client side.

That means the only drawback of the more permissive model is a lower recognition rate for the time it takes Simond to create the new model - ambiguous commands will still be handled correctly.

As soon as the more targeted model is finished building, the recognition will switch automatically.

"Isn't this post getting too long?"

Yes, definitely.

So to sum up: Simon 0.4 will feature a sophisticated model caching and context adaption mechanism.

The code involved is of course very young and even though everything works fine on my machine I of course expect there to be problems. If you are running Simon 0.3.80 or above, please report any issues you might have on the bug tracker. Thanks!

Dienstag, 1. Mai 2012

Simon: Usability

One of the simultaneously most important and challenging tasks for me has always been to keep Simon usable for the "average" user.

Yes, reading the manual is sometimes required but I still feel comfortable to say that users don't need to have in depth knowledge about speech recognition to build their own speech models with Simon - and that's something we've always been proud of.

However, the initial learning curve is undoubtedly a bit steep. So let's look at the interface that so often left new users baffled.

Analyzing Simons Interface

After the initial first run wizard (that sadly many new users seem to skip entirely) the following was the first screen that's shown to new users.

While very pretty (thanks to the Oxygen team), it only provided links to resources where users can find further help. The interface afforded absolutely no interaction pattern and left users stranded.


Simon 0.3.75: Main Screen

After a bit of looking around, the user would probably notice the "Wordlist", "Grammar", etc. tabs containing the components of the currently loaded scenario.
However, even if the user loaded scenarios in the first run wizard, all those tabs will be completely empty. That's because the user is looking at the "Standard" scenario - an empty default scenario. To change this, users are supposed to use the unlabeled drop down in the toolbar.

The reason for this weird interaction pattern was mainly because scenarios are a recent addition to Simon: They were only introduced in Simon 0.3 and while there was a huge amount of internal refactoring associated with that, the UI always felt a bit "tacked on".

So during the last month I was re-evaluating parts of Simons interface to make it more intuitive for new users.

First of all, I identified some principles I wanted to convey to the user and then designed the new interface around them:

Scenarios are opaque. Users can of course edit them if they want but the average user will probably never touch their components. In any case there is a strict hierarchy that must be maintained at all times: Scenario A (containing Components A), Scenario B (containing Components B), etc.
Base models are the easiest way to get started. If setting up Simon to use a static base model requires users to search for an archive on a wiki, download, extract it and to point Simon to individual files called cryptic names like "hmmdefs" or "tiedlist" then the interface has clearly failed. It must be easy and intuitive for users to create, share and use base models.
Around half of all recognition problems are Microphone related. For the voice activity detection (the part of Simon that separates "Speech" from "Silence") to work, the volume must be set correctly. Especially with ALSA forgetting volume levels this is often a source of problems of which the only symptom was that the recognition simply didn't work.

Obviously, the interface needed a major revamp. So over the last month I have been working on and off on some tweaks for what will become Simon 0.4.

The Result

The screenshot below shows the new Simon main screen.

Simon 0.3.80: Welcome Screen

But let's look at the changes individually.

Scenarios

There is now a prominent list of your currently used scenarios in the main screen.

The tabs showing the components of the scenario are gone and have been replaced with a little "Open <scenario name>" button.

Clicking it opens the scenario for editing. While in "edit mode", the overview is hidden. The "Back to overview"-bar drops down smoothly animated to draw the users attention.

Simon 0.3.80: Wordlist

Training

Next to the scenario list, Simons main screen now also shows a list of all available training-texts of the loaded scenarios. Clicking "Start Training" will start the standard trainings wizard without opening the "edit mode" of the scenario.

Selecting a trainings-text on the right also selects the scenario it belongs to on the left. This is done both as a visualization of which scenario will benefit the most from the training and as a matter of convenience: If the user wants to remove or add another related trainings-text (which would mean he'd need to "open" the scenario), the correct scenario is already selected.

Speech models

Speech models are now packaged into .sbm files ("Simon Base Model"). The package contains all the required model files as well as some meta data (name, model type and build date).

The welcome page shows information about the active model and, if available, the used base model.

Simon 0.3.80: Base Model Settings

The base model settings page provides a way to create the new sbm files from HTK model files ("Create from model files"). The currently used active model can be exported as sbm container to share or archive created models.

Additionally, I've already put in a request to add a new category to kde-files.org and am planning to enable speech model sharing through GHNS.

This package abstraction was also a big step towards supporting other backends next to HTK / Julius but I'll elaborate on that in a different blog post.

Recognition

Last but not least, the Simon main screen now permanently displays the current microphone volume.

The volume calibration widget has been improved to integrate the voice activity parameters and will now no longer require the user to tell it that the volume has been adjusted.

Simon 0.3.80: No applicable command for recognition result

The last recognized command is also displayed. If the command didn't trigger any action, Simon will now display a small note next to the recognized sentence to help scenario developers to track down problems.

Final Words

I am not a Usability expert by any means. Having spent so much time with the interface, I wouldn't have noticed a lot of the issues had it not been for the valuable feedback from the community. I especially want to thank Frederik Gladhorn and Bjoern Balzaks for their input.

The interface is of course still far from perfect. However, I'm quite happy about how the recent refactoring has turned out and am looking forward to more improvements in the future.

Have a suggestion or some feedback? Let me know in the comments!

Mittwoch, 4. April 2012

Astromobile: Wrapping up

The Astromobile project has been completed!

Franz (l) and Mathias (r) keeping "Astro" company

Okay, the first line should probably read "part of the Astromobile project" but I'm too excited to consider small details like that. :)

While our project partner, the ARTS lab of the Scuola Superiore Sant'Anna, has extended their navigation and localization part another couple of weeks to really finish it, the voice and touchscreen interaction and with it the part of Simon Listens has been developed, deployed and tested successfully on the robot prototype.

Have a look at the video below and see how Simon, Simone, Simontouch and even a bit of ownCloud fit together.

(Direct link to the video)

Mittwoch, 29. Februar 2012

Astromobile: Introducing simontouch

Some of you might remember the announcement of the Astromobile project a while back.

Part of the project was a voice- and touchscreen controlled kiosk software running on the robot.

Initially we were thinking about continuing our XBMC based solution, but soon decided to start from scratch.
XBMC is a great media center but it didn't fit very well with the rest of our solution.

So more out of necessity instead of huge aspirations, we decided to write a small, purpose built software called Simontouch that should - among other features - combine simple multimedia playback with communication features (phone and email).

Simontouch (to be found in the simon-tools repository) uses a QML user interface, Phonon powered video and audio playback, voice and video calling provided by Skype and a simple email client powered by Akonadi and Nepomuk.

Direct link to video

Meanwhile, our colleagues at the Scuola Superiore Sant'Anna have been working on top-notch localization and navigation as well as a great design for the robot:

Direct link to video

Our next trip to Pisa is scheduled for the middle of March and we're planning to bring all this technology together for a state of the art assistive robot - powered by KDE.

By the way: We are planning to take part in GSoC again this year. If you have any cool ideas regarding Simon or KDE Accessibility in general, check out the ideas page!

Samstag, 21. Januar 2012

Knock, Knock, KDE

After using Sourceforge for the last couple of years, simon finally joined the kool kids on the KDE infrastructure!

As part of the move, we also united the Sourceforge- and github repositories - they were only separate for organizational reasons.
We then re-organized the codebase into two projects:
simon (containing the simon application suite) and simon-tools (consisting of smaller tools that we created for various projects like a small command line utility to control Skype, a tiny calendar, and even a touch-friendly media center).

So far (about one week in) I can only say that I'm already thouroughly amazed about how incredibly active and helpful the KDE community really is - but more about that in a later post!