Revision of WA CDX Usage

1

+

# Wayback CDX Server API - BETA #

2

+

3

+

##### Changelist

4

+

5

+

* 2013-08-07 -- Add this changelist! Page size is now adjustable [Pagination API](#pagination-api)

6

+

7

+

* 2013-08-07 -- Added support for [Counters](#counters) and [Field Order](#field-order).

8

+

9

+

* 2013-08-03 -- Added support for [Collapsing](#collapsing)

10

+

11

+

12

+

##### Table of Contents

13

+

14

+

#### [Intro and Usage](#intro-and-usage)

15

+

16

+

* [Changelist](#changelist)

17

+

18

+

* [Basic usage](#basic-usage)

19

+

20

+

* [Url Match Scope](#url-match-scope)

21

+

22

+

* [Output Format (JSON)](#output-format-json)

23

+

24

+

* [Field Order](#field-order)

25

+

26

+

* [Filtering](#filtering)

27

+

28

+

* [Collapsing](#collapsing)

29

+

30

+

* [Query Result Limits](#query-result-limits)

31

+

32

+

#### [Advanced Usage](#advanced-usage)

33

+

34

+

* [Closest Timestamp Match](#closest-timestamp-match)

35

+

36

+

* [Resumption Key](#resumption)

37

+

38

+

* [Resolve Revisits](#resolve-revisits)

39

+

40

+

* [Counters](#counters)

41

+

42

+

* [Duplicate Counter](#duplicate-counter)

43

+

44

+

* [Skip Counter](#skip-counter)

45

+

46

+

* [Pagination API](#pagination-api)

47

+

48

+

* [Access Control](#access-control)

49

+

50

+

51

+

52

+

## Intro and Usage ##

53

+

54

+

The `wayback-cdx-server` is a standalone HTTP servlet that serves the index that the `wayback` machine uses to lookup captures.

55

+

56

+

The index format is known as 'cdx' and contains various fields representing the capture, usually

57

+

sorted by url and date.

58

+

http://archive.org/web/researcher/cdx_file_format.php

59

+

60

+

The server responds to GET queries and returns either the plain text CDX data, or optionally a JSON array of the CDX.

61

+

62

+

The CDX server is deployed as part of web.archive.org Wayback Machine and the usage below reference this deployment.

63

+

64

+

However, the cdx server is freely available with the rest of the open-source wayback machine software in this repository.

65

+

66

+

Further documentation will focus on configuration and deployment in other environments.

67

+

68

+

Please contant us at wwm@archive.org for additional questions.

69

+

70

+

71

+

### Basic Usage ###

72

+

73

+

The most simple query and the only required param for the CDX server is the **url** param

74

+

75

+

* http://web.archive.org/cdx/search/cdx?url=archive.org

76

+

77

+

The above query will return a portion of the index, one per row, for each 'capture' of the url "archive.org"

78

+

that is available in the archive.

79

+

80

+

The columns of each line are the fields of the cdx.

81

+

At this time, the following cdx fields are publicly available:

82

+

83

+

`["urlkey","timestamp","original","mimetype","statuscode","digest","length"]`

84

+

85

+

It is possible to customize the [Field Order](#field-order) as well.

86

+

87

+

The the **url=** value should be [url encoded](http://en.wikipedia.org/wiki/Percent-encoding) if the url itself contains a query.

88

+

89

+

All other params are optional and are explained below.

90

+

91

+

92

+

For doing large/bulk queries, the use of the [Pagination API](#pagination-api) is recommended.

93

+

94

+

95

+

### Url Match Scope ###

96

+

97

+

The default behavior is to return matches for an exact url. However, the cdx server can also return results matching a certain

98

+

prefix, a certain host or all subdomains by using the **matchType=** param.

99

+

100

+

For example, if given the url: *archive.org/about/* and:

101

+

102

+

* **matchType=exact** (default if omitted) will return results matching exactly *archive.org/about/*

103

+

104

+

* **matchType=prefix** will return results for all results under the path *archive.org/about/*

105

+

106

+

http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=prefix&limit=1000

107

+

108

+

* **matchType=host** will return results from host archive.org

109

+

110

+

http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=host&limit=1000

111

+

112

+

* **matchType=domain** will return results from host archive.org and all subhosts *.archive.org

113

+

114

+

http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=domain&limit=1000

115

+

116

+

117

+

The matchType may also be set implicitly by using wildcard '*' at end or beginning of the url:

118

+

119

+

* If url is ends in '/\*', eg **url=archive.org/\*** the query is equivalent to **url=archive.org/&matchType=prefix**

120

+

* if url starts with '\*.', eg **url=\*.archive.org/** the query is equivalent to **url=archive.org/&matchType=domain**

121

+

122

+

(Note: The *domain* mode is only available if the CDX is in SURT-order format.)

123

+

124

+

125

+

### Output Format (JSON) ##

126

+

127

+

* Output: **output=json** can be added to return results as JSON array. The JSON output currently also includes a first line which indicates the cdx format.

128

+

129

+

Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=3

130

+

```

131

+

[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],

132

+

["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"],

133

+

["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"],

134

+

["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"]]

135

+

```

136

+

137

+

* By default, CDX server returns gzip encoded data for all queries. To turn this off, add the **gzip=false** param

138

+

139

+

### Field Order ###

140

+

141

+

It is possible to customize the fields returned from the cdx server using the **fl=** param.

142

+

Simply pass in a comma separated list of fields and only those fields will be returned:

143

+

144

+

* The following returns only the timestamp and mimetype fields with the header `["timestamp","mimetype"]` http://web.archive.org/cdx/search/cdx?url=archive.org&fl=timestamp,mimetype&output=json

145

+

146

+

* If omitted, all the available fields are returned by default.

147

+

148

+

149

+

### Filtering ###

150

+

151

+

* Date Range: Results may be filtered by timestamp using **from=** and **to=** params.

152

+

The ranges are inclusive and are specified in the same 1 to 14 digit format used for `wayback` captures: *yyyyMMddhhmmss*

153

+

154

+

Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&from=2010&to=2011

155

+

156

+

157

+

* Regex filtering: It is possible to filter on a specific field or the entire CDX line (which is space delimited).

158

+

Filtering by specific field is often simpler.

159

+

Any number of filter params of the following form may be specified: **filter=**[!]*field*:*regex* may be specified.

160

+

161

+

* *field* is one of the named cdx fields (listed in the JSON query) or an index of the field. It is often useful to filter by

162

+

*mimetype* or *statuscode*

163

+

164

+

* Optional: *!* before the query inverts the match, that is, will return results that do NOT match the regex.

165

+

166

+

* *regex* is any standard Java regex pattern (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html)

167

+

168

+

169

+

* Ex: Query for 2 capture results with a non-200 status code:

170

+

171

+

http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=2&filter=!statuscode:200

172

+

173

+

174

+

* Ex: Query for 10 capture results with a non-200 status code and non text/html mime type matching a specific digest:

175

+

176

+

http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=10&filter=!statuscode:200&filter=!mimetype:text/html&filter=digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV

177

+

178

+

### Collapsing ###

179

+

180

+

A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field.

181

+

Collapsing is done on adjacent cdx lines where all captures after the first one that are duplicate are filtered out.

182

+

This is useful for filtering out captures that are 'too dense' or when looking for unique captures.

183

+

184

+

To use collapsing, add one or more **collapse=field** or **collapse=field:N** where N is the first N characters of *field* to test.

185

+

186

+

* Ex: Only show at most 1 capture per hour (compare the first 10 digits of the timestamp field). Given 2 captures 20130226010000 and 20130226010800, since first 10 digits 2013022601 match, the 2nd capture will be filtered out.

187

+

188

+

http://web.archive.org/cdx/search/cdx?url=google.com&collapse=timestamp:10

189

+

190

+

The calendar page at web.archive.org uses this filter by default: http://web.archive.org/web/*/archive.org

191

+

192

+

193

+

* Ex: Only show unique captures by digest (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected)

194

+

195

+

http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=digest

196

+

197

+

198

+

* Ex: Only show unique urls in a prefix query (filtering out captures except first capture of a given url). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):

199

+

200

+

http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=urlkey&matchType=prefix

201

+

202

+

203

+

### Query Result Limits ###

204

+

205

+

As the CDX server may return millions or billions of record, it is often necessary to set limits on a single query for practical reasons.

206

+

The CDX server provides several mechanisms, including ability to return the last N as well as first N results.

207

+

208

+

* The CDX server config provides a setting for absolute maximum length returned from a single query (currently set to 150000 by default).

209

+

210

+

* Set **limit=** *N* to return the first N results.

211

+

212

+

* Set **limit=** *-N* to return the last N results. The query may be slow as it begins reading from the beginning of the search space and skips all but last N results.

213

+

214

+

Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=-1

215

+

216

+

* *Advanced Option:* **fastLatest=true** may be set to return *some number* of latest results for an exact match and is faster than the standard last results search. The number of results is at least 1 so **limit=-1** implies this setting. The number of results may be greater >1 when a secondary index format (such as ZipNum) is used, but is not guaranteed to return any more than 1 result. Combining this setting with **limit=** will ensure that *no more* than N last results.

217

+

218

+

Ex: This query will result in upto 5 of the latest (by date) query results:

219

+

220

+

http://web.archive.org/cdx/search/cdx?url=archive.org&fastLatest=true&limit=-5

221

+

222

+

* The **offset=** *M* param can be used in conjunction with limit to 'skip' the first M records. This allows for a simple way to scroll through the results.

223

+

224

+

However, the offset/limit model does not scale well to large querties since the CDX server must read and skip through the number of results specified by

225

+

**offset**, so the CDX server begins reading at the beginning every time.

226

+

227

+

228

+

## Advanced Usage

229

+

230

+

The following features are for more specific/advanced usage of the CDX server.

231

+

232

+

233

+

### Resumption Key ###

234

+

235

+

There is also a new method that allows for the CDX server to specify 'resumption key' that can be used to continue the query from the previous end.

236

+

This allows breaking up a large query into smaller queries more efficiently.

237

+

This can be achieved by using **showResumeKey=** and **resumeKey=** params

238

+

239

+

* To show the resumption key add **showResumeKey=true** param. When set, the resume key will be printed only if the query has more results that have not be printed due to **limit=** (or max query limit) number of results reached.

240

+

241

+

* After the end of the query, the *<resumption key>* will be printed on a seperate line or seperate JSON query.

242

+

243

+

* Plain text example: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true

244

+

245

+

```

246

+

org,archive)/ 19970126045828 http://www.archive.org:80/ text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415

247

+

org,archive)/ 19971011050034 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402

248

+

org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405

249

+

org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405

250

+

org,archive)/ 19980109140106 http://archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402

251

+

252

+

org%2Carchive%29%2F+19980109140106%21

253

+

```

254

+

255

+

* JSON example: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&output=json

256

+

257

+

```

258

+

[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],

259

+

["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"],

260

+

["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"],

261

+

["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"],

262

+

["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"],

263

+

["org,archive)/", "19980109140106", "http://archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"],

264

+

[],

265

+

["org%2Carchive%29%2F+19980109140106%21"]]

266

+

```

267

+

268

+

* In a subsequent query, adding **resumeKey=** *<resumption key>* will resume the search from the next result:

269

+

No other params from the original query (such as *from=* or *url=*) need to be altered

270

+

To continue from the previous example, the subsequent query would be:

271

+

272

+

Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&resumeKey=org%2Carchive%29%2F+19980109140106%21

273

+

274

+

### Counters ###

275

+

276

+

There is some work on custom counters to enchance the aggregation capabilities of CDX server.

277

+

These features are brand new and should be considered experimental.

278

+

279

+

#### Duplicate Counter ####

280

+

281

+

While collapsing allows for filtering out adjacent results that are duplicates, it is also possible to track duplicates throughout the cdx

282

+

using a special new extension.

283

+

By adding the **showDupeCount=true** a new `dupecount` column will be added to the results.

284

+

285

+

* The duplicates are determined by tracking rows with the same `digest` field.

286

+

287

+

* The `warc/revisit` mimetype in duplicates > 0 will automatically be resolved to the mimetype of the original, if found.

288

+

289

+

* Using **showDupeCount=true** will only show unique captures: http://web.archive.org/cdx/search/cdx?url=archive.org&showDupeCount=true&output=json&limit=50

290

+

291

+

292

+

#### Skip Counter ####

293

+

294

+

It is possible to track how many CDX lines were skipped due to [Filtering](#filtering) and [Collapsing](#collapsing)

295

+

by adding the special `skipcount` counter with **showSkipCount=true**.

296

+

An optional `endtimestamp` count can also be used to print the timestamp of the last capture by adding **lastSkipTimestamp=true**

297

+

298

+

* Ex: Collapse results by year and print number of additional captures skipped and timestamp of last capture:

299

+

300

+

http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=timestamp:4&output=json&showSkipCount=true&lastSkipTimestamp=true

301

+

302

+

303

+

### Pagination API ###

304

+

305

+

The above resume key allows for sequential querying of CDX data.

306

+

However, in some cases where very large querying is needed (for example domain query), it may be useful to perform queries

307

+

in parallel and also estimate the total size of the query.

308

+

309

+

`wayback` and `cdx-server` support a secondary loading from a 'zipnum' CDX index.

310

+

This index contains CDX lines stored in concatenated GZIP blocks (usually 3,000 lines each) and a secondary index

311

+

which provides binary search to the 'zipnum' blocks.

312

+

By using the secondary index, it is possible to estimate the total size of a query and also break up the query in size.

313

+

Using the zipnum format or other secondary index is needed to support pagination.

314

+

315

+

However, pagination can only work on a single index at a time; merging input from multiple sources (plain cdx or zipnum)

316

+

is not possible. As such, the results from a paginated query may be slightly less up-to-date than

317

+

a default non-paginated query.

318

+

319

+

* To use pagination, simply add the **page=i** param to the query to return the i-th page. If pagination is not supported, the CDX server will return a 400.

320

+

321

+

* Pages are numbered from 0 to *num pages - 1*. If *i<0*, pages are not used. If *i>=num pages*, no results are returned.

322

+

323

+

Ex: First page: http://web.archive.org/cdx/search/cdx?url=archive.org&page=0

324

+

325

+

Ex: Next Page: http://web.archive.org/cdx/search/cdx?url=archive.org&page=1

326

+

327

+

328

+

* To determine the number of pages, add the **showNumPages=true** param. This is a special query that will return a single number indicating the number of pages

329

+

330

+

Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&showNumPages=true

331

+

332

+

* Page size is the number of zipnum blocks scanned per page (so a page size of `1` will contain *up to* 3,000 results per page). This means the number of results on each page will vary, because each block may have a different number of CDX lines matching your query. Page size is configured to an optimal value on the CDX server, and may be similar to max query limit in non-paged mode. The CDX server on archive.org currently has a page size of 50.

333

+

334

+

* It is possible to adjust the page size to a smaller value than the default by setting the **pageSize=P** where 1 <= P <= default page size.

335

+

336

+

Ex: Get # of pages with smallest page size: http://web.archive.org/cdx/search/cdx?url=archive.org&showNumPages=true&pageSize=1

337

+

338

+

Ex: Get first page with smallest page size: http://web.archive.org/cdx/search/cdx?url=archive.org&page=0&pageSize=1

339

+

340

+

341

+

* If there is only one page, adding the **page=0** param will return the same results as without setting a page.

342

+

343

+

* It is also possible to have the CDX server return the raw secondary index, by specifying **showPagedIndex=true**. This query returns the secondary index instead of the CDX results and may be subject to access restrictions.

344

+

345

+

* All other params, including the resumeKey= should work in conjunction with pagination.

346

+

347

+

348

+

349

+

### Access Control ###

350

+

351

+

The cdx server is designed to improve access to archived data to a broad audience, but it may be necessary to restrict certain parts of the cdx.

352

+

353

+

The cdx server provides greanting permissions to restricted data via an API key that is passed in as a cookie.

354

+

355

+

Currently two restrictions/permission types are supported:

356

+

357

+

* Access to certain urls which are considered private. When restricted, only public urls are included in query results and access to secondary index is restricted.

358

+

359

+

* Access to certain fields, such as filename in the CDX. When restricted, the cdx results contain only public fields.

360

+

361

+

362

+

To allow access, the API key cookie must be explicitly set on the client, eg:

363

+

364

+

```

365

+

curl -H "Cookie: cdx-auth-token=API-Key-Secret http://mycdxserver/search/cdx?url=..."

366

+

```

367

+

368

+

The *API-Key-Secret* can be set in the cdx server configuration.

369

+

370

+

371

+

## CDX Server Configuration ##

372

+

373

+

374

+

TODO

375

+

376

+

Start by editing the wayback-cdx-server-servlet.xml File in the WEB-INF Directory. Just put some valid CDX-Files in the cdxUris-List (Files must end with cdx or cdx.gz!)

377

+

378

+

d13forme / WA CDX Usage

d13forme revidoval tento gist 1753336843. Přejít na revizi

		@@ -0,0 +1,378 @@
1	+	# Wayback CDX Server API - BETA #
2	+
3	+	##### Changelist
4	+
5	+	* 2013-08-07 -- Add this changelist! Page size is now adjustable [Pagination API](#pagination-api)
6	+
7	+	* 2013-08-07 -- Added support for [Counters](#counters) and [Field Order](#field-order).
8	+
9	+	* 2013-08-03 -- Added support for [Collapsing](#collapsing)
10	+
11	+
12	+	##### Table of Contents
13	+
14	+	#### [Intro and Usage](#intro-and-usage)
15	+
16	+	* [Changelist](#changelist)
17	+
18	+	* [Basic usage](#basic-usage)
19	+
20	+	* [Url Match Scope](#url-match-scope)
21	+
22	+	* [Output Format (JSON)](#output-format-json)
23	+
24	+	* [Field Order](#field-order)
25	+
26	+	* [Filtering](#filtering)
27	+
28	+	* [Collapsing](#collapsing)
29	+
30	+	* [Query Result Limits](#query-result-limits)
31	+
32	+	#### [Advanced Usage](#advanced-usage)
33	+
34	+	* [Closest Timestamp Match](#closest-timestamp-match)
35	+
36	+	* [Resumption Key](#resumption)
37	+
38	+	* [Resolve Revisits](#resolve-revisits)
39	+
40	+	* [Counters](#counters)
41	+
42	+	* [Duplicate Counter](#duplicate-counter)
43	+
44	+	* [Skip Counter](#skip-counter)
45	+
46	+	* [Pagination API](#pagination-api)
47	+
48	+	* [Access Control](#access-control)
49	+
50	+
51	+
52	+	## Intro and Usage ##
53	+
54	+	The `wayback-cdx-server` is a standalone HTTP servlet that serves the index that the `wayback` machine uses to lookup captures.
55	+
56	+	The index format is known as 'cdx' and contains various fields representing the capture, usually
57	+	sorted by url and date.
58	+	http://archive.org/web/researcher/cdx_file_format.php
59	+
60	+	The server responds to GET queries and returns either the plain text CDX data, or optionally a JSON array of the CDX.
61	+
62	+	The CDX server is deployed as part of web.archive.org Wayback Machine and the usage below reference this deployment.
63	+
64	+	However, the cdx server is freely available with the rest of the open-source wayback machine software in this repository.
65	+
66	+	Further documentation will focus on configuration and deployment in other environments.
67	+
68	+	Please contant us at wwm@archive.org for additional questions.
69	+
70	+
71	+	### Basic Usage ###
72	+
73	+	The most simple query and the only required param for the CDX server is the url param
74	+
75	+	* http://web.archive.org/cdx/search/cdx?url=archive.org
76	+
77	+	The above query will return a portion of the index, one per row, for each 'capture' of the url "archive.org"
78	+	that is available in the archive.
79	+
80	+	The columns of each line are the fields of the cdx.
81	+	At this time, the following cdx fields are publicly available:
82	+
83	+	`["urlkey","timestamp","original","mimetype","statuscode","digest","length"]`
84	+
85	+	It is possible to customize the [Field Order](#field-order) as well.
86	+
87	+	The the url= value should be [url encoded](http://en.wikipedia.org/wiki/Percent-encoding) if the url itself contains a query.
88	+
89	+	All other params are optional and are explained below.
90	+
91	+
92	+	For doing large/bulk queries, the use of the [Pagination API](#pagination-api) is recommended.
93	+
94	+
95	+	### Url Match Scope ###
96	+
97	+	The default behavior is to return matches for an exact url. However, the cdx server can also return results matching a certain
98	+	prefix, a certain host or all subdomains by using the matchType= param.
99	+
100	+	For example, if given the url: archive.org/about/ and:
101	+
102	+	* matchType=exact (default if omitted) will return results matching exactly archive.org/about/
103	+
104	+	* matchType=prefix will return results for all results under the path archive.org/about/
105	+
106	+	http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=prefix&limit=1000
107	+
108	+	* matchType=host will return results from host archive.org
109	+
110	+	http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=host&limit=1000
111	+
112	+	* matchType=domain will return results from host archive.org and all subhosts *.archive.org
113	+
114	+	http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=domain&limit=1000
115	+
116	+
117	+	The matchType may also be set implicitly by using wildcard '*' at end or beginning of the url:
118	+
119	+	* If url is ends in '/\', eg url=archive.org/\ the query is equivalent to url=archive.org/&matchType=prefix**
120	+	* if url starts with '\.', eg url=\.archive.org/ the query is equivalent to url=archive.org/&matchType=domain**
121	+
122	+	(Note: The domain mode is only available if the CDX is in SURT-order format.)
123	+
124	+
125	+	### Output Format (JSON) ##
126	+
127	+	* Output: output=json can be added to return results as JSON array. The JSON output currently also includes a first line which indicates the cdx format.
128	+
129	+	Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=3
130	+	```
131	+	[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
132	+	["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"],
133	+	["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"],
134	+	["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"]]
135	+	```
136	+
137	+	* By default, CDX server returns gzip encoded data for all queries. To turn this off, add the gzip=false param
138	+
139	+	### Field Order ###
140	+
141	+	It is possible to customize the fields returned from the cdx server using the fl= param.
142	+	Simply pass in a comma separated list of fields and only those fields will be returned:
143	+
144	+	* The following returns only the timestamp and mimetype fields with the header `["timestamp","mimetype"]` http://web.archive.org/cdx/search/cdx?url=archive.org&fl=timestamp,mimetype&output=json
145	+
146	+	* If omitted, all the available fields are returned by default.
147	+
148	+
149	+	### Filtering ###
150	+
151	+	* Date Range: Results may be filtered by timestamp using from= and to= params.
152	+	The ranges are inclusive and are specified in the same 1 to 14 digit format used for `wayback` captures: yyyyMMddhhmmss
153	+
154	+	Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&from=2010&to=2011
155	+
156	+
157	+	* Regex filtering: It is possible to filter on a specific field or the entire CDX line (which is space delimited).
158	+	Filtering by specific field is often simpler.
159	+	Any number of filter params of the following form may be specified: filter=[!]field:regex may be specified.
160	+
161	+	* field is one of the named cdx fields (listed in the JSON query) or an index of the field. It is often useful to filter by
162	+	mimetype or statuscode
163	+
164	+	* Optional: ! before the query inverts the match, that is, will return results that do NOT match the regex.
165	+
166	+	* regex is any standard Java regex pattern (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html)
167	+
168	+
169	+	* Ex: Query for 2 capture results with a non-200 status code:
170	+
171	+	http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=2&filter=!statuscode:200
172	+
173	+
174	+	* Ex: Query for 10 capture results with a non-200 status code and non text/html mime type matching a specific digest:
175	+
176	+	http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=10&filter=!statuscode:200&filter=!mimetype:text/html&filter=digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV
177	+
178	+	### Collapsing ###
179	+
180	+	A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field.
181	+	Collapsing is done on adjacent cdx lines where all captures after the first one that are duplicate are filtered out.
182	+	This is useful for filtering out captures that are 'too dense' or when looking for unique captures.
183	+
184	+	To use collapsing, add one or more collapse=field or collapse=field:N where N is the first N characters of field to test.
185	+
186	+	* Ex: Only show at most 1 capture per hour (compare the first 10 digits of the timestamp field). Given 2 captures 20130226010000 and 20130226010800, since first 10 digits 2013022601 match, the 2nd capture will be filtered out.
187	+
188	+	http://web.archive.org/cdx/search/cdx?url=google.com&collapse=timestamp:10
189	+
190	+	The calendar page at web.archive.org uses this filter by default: http://web.archive.org/web/*/archive.org
191	+
192	+
193	+	* Ex: Only show unique captures by digest (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected)
194	+
195	+	http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=digest
196	+
197	+
198	+	* Ex: Only show unique urls in a prefix query (filtering out captures except first capture of a given url). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):
199	+
200	+	http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=urlkey&matchType=prefix
201	+
202	+
203	+	### Query Result Limits ###
204	+
205	+	As the CDX server may return millions or billions of record, it is often necessary to set limits on a single query for practical reasons.
206	+	The CDX server provides several mechanisms, including ability to return the last N as well as first N results.
207	+
208	+	* The CDX server config provides a setting for absolute maximum length returned from a single query (currently set to 150000 by default).
209	+
210	+	* Set limit= N to return the first N results.
211	+
212	+	* Set limit= -N to return the last N results. The query may be slow as it begins reading from the beginning of the search space and skips all but last N results.
213	+
214	+	Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=-1
215	+
216	+	* Advanced Option: fastLatest=true may be set to return some number of latest results for an exact match and is faster than the standard last results search. The number of results is at least 1 so limit=-1 implies this setting. The number of results may be greater >1 when a secondary index format (such as ZipNum) is used, but is not guaranteed to return any more than 1 result. Combining this setting with limit= will ensure that no more than N last results.
217	+
218	+	Ex: This query will result in upto 5 of the latest (by date) query results:
219	+
220	+	http://web.archive.org/cdx/search/cdx?url=archive.org&fastLatest=true&limit=-5
221	+
222	+	* The offset= M param can be used in conjunction with limit to 'skip' the first M records. This allows for a simple way to scroll through the results.
223	+
224	+	However, the offset/limit model does not scale well to large querties since the CDX server must read and skip through the number of results specified by
225	+	offset, so the CDX server begins reading at the beginning every time.
226	+
227	+
228	+	## Advanced Usage
229	+
230	+	The following features are for more specific/advanced usage of the CDX server.
231	+
232	+
233	+	### Resumption Key ###
234	+
235	+	There is also a new method that allows for the CDX server to specify 'resumption key' that can be used to continue the query from the previous end.
236	+	This allows breaking up a large query into smaller queries more efficiently.
237	+	This can be achieved by using showResumeKey= and resumeKey= params
238	+
239	+	* To show the resumption key add showResumeKey=true param. When set, the resume key will be printed only if the query has more results that have not be printed due to limit= (or max query limit) number of results reached.
240	+
241	+	* After the end of the query, the <resumption key> will be printed on a seperate line or seperate JSON query.
242	+
243	+	* Plain text example: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true
244	+
245	+	```
246	+	org,archive)/ 19970126045828 http://www.archive.org:80/ text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415
247	+	org,archive)/ 19971011050034 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402
248	+	org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405
249	+	org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405
250	+	org,archive)/ 19980109140106 http://archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402
251	+
252	+	org%2Carchive%29%2F+19980109140106%21
253	+	```
254	+
255	+	* JSON example: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&output=json
256	+
257	+	```
258	+	[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
259	+	["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"],
260	+	["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"],
261	+	["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"],
262	+	["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"],
263	+	["org,archive)/", "19980109140106", "http://archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"],
264	+	[],
265	+	["org%2Carchive%29%2F+19980109140106%21"]]
266	+	```
267	+
268	+	* In a subsequent query, adding resumeKey= <resumption key> will resume the search from the next result:
269	+	No other params from the original query (such as from= or url=) need to be altered
270	+	To continue from the previous example, the subsequent query would be:
271	+
272	+	Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&resumeKey=org%2Carchive%29%2F+19980109140106%21
273	+
274	+	### Counters ###
275	+
276	+	There is some work on custom counters to enchance the aggregation capabilities of CDX server.
277	+	These features are brand new and should be considered experimental.
278	+
279	+	#### Duplicate Counter ####
280	+
281	+	While collapsing allows for filtering out adjacent results that are duplicates, it is also possible to track duplicates throughout the cdx
282	+	using a special new extension.
283	+	By adding the showDupeCount=true a new `dupecount` column will be added to the results.
284	+
285	+	* The duplicates are determined by tracking rows with the same `digest` field.
286	+
287	+	* The `warc/revisit` mimetype in duplicates > 0 will automatically be resolved to the mimetype of the original, if found.
288	+
289	+	* Using showDupeCount=true will only show unique captures: http://web.archive.org/cdx/search/cdx?url=archive.org&showDupeCount=true&output=json&limit=50
290	+
291	+
292	+	#### Skip Counter ####
293	+
294	+	It is possible to track how many CDX lines were skipped due to [Filtering](#filtering) and [Collapsing](#collapsing)
295	+	by adding the special `skipcount` counter with showSkipCount=true.
296	+	An optional `endtimestamp` count can also be used to print the timestamp of the last capture by adding lastSkipTimestamp=true
297	+
298	+	* Ex: Collapse results by year and print number of additional captures skipped and timestamp of last capture:
299	+
300	+	http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=timestamp:4&output=json&showSkipCount=true&lastSkipTimestamp=true
301	+
302	+
303	+	### Pagination API ###
304	+
305	+	The above resume key allows for sequential querying of CDX data.
306	+	However, in some cases where very large querying is needed (for example domain query), it may be useful to perform queries
307	+	in parallel and also estimate the total size of the query.
308	+
309	+	`wayback` and `cdx-server` support a secondary loading from a 'zipnum' CDX index.
310	+	This index contains CDX lines stored in concatenated GZIP blocks (usually 3,000 lines each) and a secondary index
311	+	which provides binary search to the 'zipnum' blocks.
312	+	By using the secondary index, it is possible to estimate the total size of a query and also break up the query in size.
313	+	Using the zipnum format or other secondary index is needed to support pagination.
314	+
315	+	However, pagination can only work on a single index at a time; merging input from multiple sources (plain cdx or zipnum)
316	+	is not possible. As such, the results from a paginated query may be slightly less up-to-date than
317	+	a default non-paginated query.
318	+
319	+	* To use pagination, simply add the page=i param to the query to return the i-th page. If pagination is not supported, the CDX server will return a 400.
320	+
321	+	* Pages are numbered from 0 to num pages - 1. If i<0, pages are not used. If i>=num pages, no results are returned.
322	+
323	+	Ex: First page: http://web.archive.org/cdx/search/cdx?url=archive.org&page=0
324	+
325	+	Ex: Next Page: http://web.archive.org/cdx/search/cdx?url=archive.org&page=1
326	+
327	+
328	+	* To determine the number of pages, add the showNumPages=true param. This is a special query that will return a single number indicating the number of pages
329	+
330	+	Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&showNumPages=true
331	+
332	+	* Page size is the number of zipnum blocks scanned per page (so a page size of `1` will contain up to 3,000 results per page). This means the number of results on each page will vary, because each block may have a different number of CDX lines matching your query. Page size is configured to an optimal value on the CDX server, and may be similar to max query limit in non-paged mode. The CDX server on archive.org currently has a page size of 50.
333	+
334	+	* It is possible to adjust the page size to a smaller value than the default by setting the pageSize=P where 1 <= P <= default page size.
335	+
336	+	Ex: Get # of pages with smallest page size: http://web.archive.org/cdx/search/cdx?url=archive.org&showNumPages=true&pageSize=1
337	+
338	+	Ex: Get first page with smallest page size: http://web.archive.org/cdx/search/cdx?url=archive.org&page=0&pageSize=1
339	+
340	+
341	+	* If there is only one page, adding the page=0 param will return the same results as without setting a page.
342	+
343	+	* It is also possible to have the CDX server return the raw secondary index, by specifying showPagedIndex=true. This query returns the secondary index instead of the CDX results and may be subject to access restrictions.
344	+
345	+	* All other params, including the resumeKey= should work in conjunction with pagination.
346	+
347	+
348	+
349	+	### Access Control ###
350	+
351	+	The cdx server is designed to improve access to archived data to a broad audience, but it may be necessary to restrict certain parts of the cdx.
352	+
353	+	The cdx server provides greanting permissions to restricted data via an API key that is passed in as a cookie.
354	+
355	+	Currently two restrictions/permission types are supported:
356	+
357	+	* Access to certain urls which are considered private. When restricted, only public urls are included in query results and access to secondary index is restricted.
358	+
359	+	* Access to certain fields, such as filename in the CDX. When restricted, the cdx results contain only public fields.
360	+
361	+
362	+	To allow access, the API key cookie must be explicitly set on the client, eg:
363	+
364	+	```
365	+	curl -H "Cookie: cdx-auth-token=API-Key-Secret http://mycdxserver/search/cdx?url=..."
366	+	```
367	+
368	+	The API-Key-Secret can be set in the cdx server configuration.
369	+
370	+
371	+	## CDX Server Configuration ##
372	+
373	+
374	+	TODO
375	+
376	+	Start by editing the wayback-cdx-server-servlet.xml File in the WEB-INF Directory. Just put some valid CDX-Files in the cdxUris-List (Files must end with cdx or cdx.gz!)
377	+
378	+