Roadmap for CWB development
back to top
We are currently working on the following development plan:
We intend to actively maintain the 3.0 branch until the unstable branch we are working on (3.4)
has had the kinks worked out of it. At this point, 3.4 will give birth to 3.5 and the new stable 3.5.x will become the "active"
branch recommended for regular users.
- version 3.0 is a stable version and first "complete" open-source release
- version 3.1 adds Windows compatibility
- version 3.2 adds Unicode support
- version 3.3 never happened!
- version 3.4 will be a series of betas leading up to the next stable release
- version 3.5 will be a stable version of the new features from 3.1 and 3.2
- version 3.9 will contain pre-release development work for 4.0
- version 4.0 will be a new stable version that breaks backward compatability in a significant way
A somewhat more detailed TODO list for the immediate future can be found
doc section of the SVN repository for the CWB core (file
Progress on 3.4: Not much so far! Our aim is to get down to zero known bugs across this set of versions.
We currently know of a range of little problems plus three or four showstoppers (mostly linked ot meory leaks and/or Unicode).
(P.S. For the rationale behind those "skipped" version numbers, see
doc/numbering-rules.html in the Subversion repository for the CWB core.)
Many new features for CQPweb are either planned or in the works. They will be added incrementally in versions
3.1.x to 3.8.x of the software. The following is the “wish list”
for CQPweb development.
Features which are high priority, nearly complete, or otherwise expected to be finished soon,
are highlighted in green. Features which are either low priority, or
which will require major work to implement, and which will thus not be arriving any time soon,
are marked red. Features which are somewhere in the middle are marked
orange. They are listed in rough order of descending urgency.
Completed features are struck through.
We also keep a record of feature requests in the Sourceforge tracker, which is open to users as well
as developers. So, to request a feature that is not on this list, or if you want to let us know which
features from the list you desperately need,
to browse the database for current feature requests, add comments, or post new requests.
A chorus of demands for a particular feature may well induce us to change its colour-status!
Display XML elements. Allow XML information to be visualised,
e.g. by letting timecode information "link out" to related sound files on another website. This is the
highest-priority new feature and is partially implemented but incomplete.
Corpus categories. Better system for categorising corpora on the homepage.
Annotation templates. Save work in indexing by having more "default" templates
than just the Lancaster-tagger setup.
Plugins. A framework for administrators to write their own code that
interfaces CQPweb with external tasks like annotating corpora (using tagger programs), checking file formats,
or transliterating corpus text for rendering in a different writing system. Some default plugins will be
supplied. The interface is nearly complete, but code for managing plugins is needed.
Rendering support for interlinear-morpheme-glossed corpora. This is done in
concordance, still needed in context view and concordance download.
Restrict queries by XML. Like "spoken restrictions" in BNCweb. The single most
requested feature but very difficult to implement. It will be added in version 3.2.x.
Allow users to import their own data. Also a highly
requested feature but very difficult to implement. Upcoming in the 3.2.x series.
Better corpus usage statistics. For example, see what corpora an individual
user is using most.
Persistent metadata table designs. Another aid to indexing corpora.
Standardised frequency breakdown. Amend interface such that it can be applied
to the sort position as well as the mode. This may also involve making it possible to run the FB-function
with annotations other than the primary.
User account automated request system. Because I'm tired of managing
this via email! The system used on Lancaster's BNCweb server
(signup interface here)
will be the model, in broad terms, but something more customisable and more tightly integrated into
CQPweb will be designed. Requests for user accounts should be automated, and should sit for the
admin's attention on next log in. Approval should be a matter of clicking a button!
This will involve switching to user authentication via session
Fixing this should also make it possible for users to
view their own access rights, which is not currently possible
Upload query function. One of the few bits of BNCweb not yet replicated
R interface. Connection to the R statistical environment as backend
for advanced statistics (e.g. cluster analysis) and to get nice graphs. This has been alrgely designed and coded,
but needs to be tested and then "plumbed in" to CQPweb where appropriate.
CQPweb on windows. Not too hard to do, but will require changing the config
file format, so will only happen on a major version change (probably the move up to v 3.5).
Text position labels. Configurable "where in the text are we" indicators
in the concordance display next to the text-ID (equivalent to sentence numbers in BNCweb).
Attribute information. Make it easy for users who know what they're doing to see
what p-attributes and s-attributes are available.
Documentation. An ongoing struggle, things are slowly improving, but
more work is needed.
Web API. Create API-via-HTTP to allow other applications to use CQPweb as
a back-end. This will make it possible for institutions to have highly personalised websites that generate
concordances through CQPweb (e.g. without requiring usernames...)
Interface for public frequency lists. Allowing them to be managed within the
admin control panel.
More character encodings. Support for corpora in any CWB-supported backend encoding;
the interface will continue to be all UTF8, all the time.
CSV outputs. Allow download files in CSV as well as tab-delimited formats.
Cache compiled PHP scripts. This will be a small speed improvement
(although the speed of the system largely depends on the MySQL database, in fact).
Variable context width. On a per-user basis. Nonwrapping context should also be
Table-select gizmo. On collocation and concordance tables, to help with exporting
data to spreadsheets, word processors, etc.; making sure that all and only the table is selected for copy-paste.
A small square hovering in the left of the table's header bar which can be clicked to select the table should do
Customised collations. Currently there is a per-corpus choice between utf_bin and
utf8_general_ci for sorting/collating linguistic data. Neither is satisfactory. What are needed are special collations
just for CQPweb (utf8_cqpweb_cs_as, utf8_cqpweb_cs_ai, utf8_cqpweb_ci_as, utf8_cqpweb_ci_ai) that treat accented
characters reasonably, always sort punctuation after anything letter-like, etc. It is possible to define
new collations for MySQL to load when the daemon starts up (in an XML format that describe's the new collation's
diff from utf8_unicode_ci, whic could theoretically be auto-generated
from the Unicode database), however, what I don't know is whether or not there is a
performance penalty for using such a collation especially if it has many, mnay differences from utf8_unicode_ci.
So in sum this is an important feature, but lots of work and fraught in many ways.
N-grams. High priority, but quite a lot of work.
Extended analysis functions.
Collocation-by-syntactic-relationship. Collocational patterns between words
in predefined grammatical patterns (e.g. noun and modifying adjective, verb and object, etc.) Would only be
supported where requisite annotation is available. A crucial feature but cannot be implemented prior to CWB v 3.9.
Collocation networks. The method is established but not yet built into any
Closure curves. Graphs of lexical closure à la McEnery and Wilson
(1996/2001), generated on-the-fly using cwb-decode and R - but cached! - for corpora and subcorpora.
Factor analysis. Applied to sets of query results.
Cluster analysis of texts. Applied to sets of query results.
Standardised TTR. Calculated and cached per corpus, for a configurable standardisation
Distribution display. Like in WordSmith, but with the "unit" of analysis configurable
Enhanced query annotation. CQPweb currently allows a single annotation
field, like BNCweb, via the "Categorise query" function. Manual annotation of queries should be more
flexible, to support (for instance) Gries and Divjak's "Behavioural Profiles" analysis. Cluster analysis
of annotated queries should of course be built in.
Improved sorting. One frequent request is for multiple sort keys (primary,
secondary, tertiary). This will require a revision to the interface.
Query-diffs and subqueries. Currently just a sketchy idea.
Subcorpus-as-corpus. Allow part of an indexed corpus to appear to users
as if it is a separate corpus (i.e. using separate databases, but the same underlying CWB data files).
Teacher mode. Give teachers access to the query history, etc., of defined
groups of students (without giving them full admin access).
Onscreen keyboards. Pop-up keyboards in different screens to let users type in
massive pain to program this, and lots of other things are higher priority).
CEQL in PHP. To remove the overhead of starting up Perl as a backend.
This should ideally be done programmatically by translating the Perl module into PHP, if at all possible.
Index and search aligned corpora. Offer an interface to cwb-align and
cwb-align-encode. Display aligned concordances.
Graphical query builder. As an alternative to CEQL as a frotn end to CQP-syntax; this
Concgrams. As in the program of the same name. Very low-priority enhancement,
requested by one user, off the to-do list for the foreseeable future unless more people request it.
back to top
Last modified: 07/10/2015 —
powered by GopherPHP version 1.1 (© 2008–2010 by Stefan Evert)