3.6 Random subsets

- when there are a lot of matches, e.g.
`> A = "time";`

`> size A;`

it is often desirable to look at a random selection to get a quick overview (rather than just seeing matches from the first part of the corpus); one possibility is to do a

`sort randomize`and then go through the first few pages of random matches:`> sort A randomize;`

however, this cannot be combined with other sort options such as alphabetical sorting on match or left/right context; it also doesn't speed up frequency lists,

`set target`and other post-processing operations - as an alternative to randomized ordering, the
`reduce`command randomly selects a given number or proportion of matches, deleting all other matches from the named query; since this operation is destructive, it may be necessary to make a copy of the original query results first (see above)`> reduce A to 10%;`

`> size A;`

`> sort A by word %cd on match .. matchend[42];`

`> reduce A to 100;`

`> size A;`

`> sort A by word %cd on match .. matchend[42];`

this allows arbitrary further operations to be carried out on a representative sample rather than the full query result

- set random number generator seed before
`reduce`for reproducible selection`> randomize 42;`

(use any positive integer as seed) - a second method for obtaining a random subset of a named query result is
to sort the matches in random order and then take the first matches from
the sorted query; the example below has the same effect as
`reduce A to 100;`(though it will not select exactly the same matches)`> sort A randomize;`

`> cut A 100;`

`> sort A;`

(restore corpus order, as with`reduce`command)reproducible subsets can be obtained with a suitable

`randomize`command before the`sort`; the main difference from the`reduce`command is that`cut`cannot be used to select a percentage of matches (i.e., you have to determine the number of matches in the desired subset yourself) - the most important advantage of the second method is that it can produce
stable and incremental random samples
- for a stable random ordering, specify a positive seed value directly in
the sort command:
`> sort A randomize 42;`

different seeds give different, reproducible orderings; if you randomize a subset of

`A`with the same seed value, the matches will appear exactly in the same order as in the randomized version of`A`:`> A = "interesting" cut 20;`

(just for illustration)

`> B = A;`

`> reduce B to 10;`

(an arbitrary subset of A)

`> sort A randomize 42;`

`> sort B randomize 42;`

- in order to build incremental random samples from a query result, sort
it randomly (but with seed value to ensure reproducibility) and then take
the first matches as sample #1, the next matches as sample #2,
etc.; unlike two subsets generated with
`reduce`, the first two samples are disjoint and together form a random sample of size :`> A = "time";`

`> sort A randomize 7;`

`> Sample1 = A;`

`> cut Sample1 0 99;`

(random sample of 100 matches)

`> Sample2 = A;`

`> cut Sample2 100 199;`

(random sample of 100 matches)note that the

`cut`removes the randomized ordering; you can reapply the stable randomization to achieve full correspondence to the randomized query result`A`:`> sort Sample2 randomize 7;`

`> cat Sample2;`

`> cat A 100 199;`

- stability of the randomization ensures that random samples are
reproducible even after the initial query has been refined or spurious
matches have been deleted manually

Andrew Hardie 2017-01-17