Naposledy aktivní 1753336844

d13forme revidoval tento gist 1753336843. Přejít na revizi

1 file changed, 378 insertions

README.md(vytvořil soubor)

@@ -0,0 +1,378 @@
1 + # Wayback CDX Server API - BETA #
2 +
3 + ##### Changelist
4 +
5 + * 2013-08-07 -- Add this changelist! Page size is now adjustable [Pagination API](#pagination-api)
6 +
7 + * 2013-08-07 -- Added support for [Counters](#counters) and [Field Order](#field-order).
8 +
9 + * 2013-08-03 -- Added support for [Collapsing](#collapsing)
10 +
11 +
12 + ##### Table of Contents
13 +
14 + #### [Intro and Usage](#intro-and-usage)
15 +
16 + * [Changelist](#changelist)
17 +
18 + * [Basic usage](#basic-usage)
19 +
20 + * [Url Match Scope](#url-match-scope)
21 +
22 + * [Output Format (JSON)](#output-format-json)
23 +
24 + * [Field Order](#field-order)
25 +
26 + * [Filtering](#filtering)
27 +
28 + * [Collapsing](#collapsing)
29 +
30 + * [Query Result Limits](#query-result-limits)
31 +
32 + #### [Advanced Usage](#advanced-usage)
33 +
34 + * [Closest Timestamp Match](#closest-timestamp-match)
35 +
36 + * [Resumption Key](#resumption)
37 +
38 + * [Resolve Revisits](#resolve-revisits)
39 +
40 + * [Counters](#counters)
41 +
42 + * [Duplicate Counter](#duplicate-counter)
43 +
44 + * [Skip Counter](#skip-counter)
45 +
46 + * [Pagination API](#pagination-api)
47 +
48 + * [Access Control](#access-control)
49 +
50 +
51 +
52 + ## Intro and Usage ##
53 +
54 + The `wayback-cdx-server` is a standalone HTTP servlet that serves the index that the `wayback` machine uses to lookup captures.
55 +
56 + The index format is known as 'cdx' and contains various fields representing the capture, usually
57 + sorted by url and date.
58 + http://archive.org/web/researcher/cdx_file_format.php
59 +
60 + The server responds to GET queries and returns either the plain text CDX data, or optionally a JSON array of the CDX.
61 +
62 + The CDX server is deployed as part of web.archive.org Wayback Machine and the usage below reference this deployment.
63 +
64 + However, the cdx server is freely available with the rest of the open-source wayback machine software in this repository.
65 +
66 + Further documentation will focus on configuration and deployment in other environments.
67 +
68 + Please contant us at wwm@archive.org for additional questions.
69 +
70 +
71 + ### Basic Usage ###
72 +
73 + The most simple query and the only required param for the CDX server is the **url** param
74 +
75 + * http://web.archive.org/cdx/search/cdx?url=archive.org
76 +
77 + The above query will return a portion of the index, one per row, for each 'capture' of the url "archive.org"
78 + that is available in the archive.
79 +
80 + The columns of each line are the fields of the cdx.
81 + At this time, the following cdx fields are publicly available:
82 +
83 + `["urlkey","timestamp","original","mimetype","statuscode","digest","length"]`
84 +
85 + It is possible to customize the [Field Order](#field-order) as well.
86 +
87 + The the **url=** value should be [url encoded](http://en.wikipedia.org/wiki/Percent-encoding) if the url itself contains a query.
88 +
89 + All other params are optional and are explained below.
90 +
91 +
92 + For doing large/bulk queries, the use of the [Pagination API](#pagination-api) is recommended.
93 +
94 +
95 + ### Url Match Scope ###
96 +
97 + The default behavior is to return matches for an exact url. However, the cdx server can also return results matching a certain
98 + prefix, a certain host or all subdomains by using the **matchType=** param.
99 +
100 + For example, if given the url: *archive.org/about/* and:
101 +
102 + * **matchType=exact** (default if omitted) will return results matching exactly *archive.org/about/*
103 +
104 + * **matchType=prefix** will return results for all results under the path *archive.org/about/*
105 +
106 + http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=prefix&limit=1000
107 +
108 + * **matchType=host** will return results from host archive.org
109 +
110 + http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=host&limit=1000
111 +
112 + * **matchType=domain** will return results from host archive.org and all subhosts *.archive.org
113 +
114 + http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=domain&limit=1000
115 +
116 +
117 + The matchType may also be set implicitly by using wildcard '*' at end or beginning of the url:
118 +
119 + * If url is ends in '/\*', eg **url=archive.org/\*** the query is equivalent to **url=archive.org/&matchType=prefix**
120 + * if url starts with '\*.', eg **url=\*.archive.org/** the query is equivalent to **url=archive.org/&matchType=domain**
121 +
122 + (Note: The *domain* mode is only available if the CDX is in SURT-order format.)
123 +
124 +
125 + ### Output Format (JSON) ##
126 +
127 + * Output: **output=json** can be added to return results as JSON array. The JSON output currently also includes a first line which indicates the cdx format.
128 +
129 + Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=3
130 + ```
131 + [["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
132 + ["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"],
133 + ["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"],
134 + ["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"]]
135 + ```
136 +
137 + * By default, CDX server returns gzip encoded data for all queries. To turn this off, add the **gzip=false** param
138 +
139 + ### Field Order ###
140 +
141 + It is possible to customize the fields returned from the cdx server using the **fl=** param.
142 + Simply pass in a comma separated list of fields and only those fields will be returned:
143 +
144 + * The following returns only the timestamp and mimetype fields with the header `["timestamp","mimetype"]` http://web.archive.org/cdx/search/cdx?url=archive.org&fl=timestamp,mimetype&output=json
145 +
146 + * If omitted, all the available fields are returned by default.
147 +
148 +
149 + ### Filtering ###
150 +
151 + * Date Range: Results may be filtered by timestamp using **from=** and **to=** params.
152 + The ranges are inclusive and are specified in the same 1 to 14 digit format used for `wayback` captures: *yyyyMMddhhmmss*
153 +
154 + Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&from=2010&to=2011
155 +
156 +
157 + * Regex filtering: It is possible to filter on a specific field or the entire CDX line (which is space delimited).
158 + Filtering by specific field is often simpler.
159 + Any number of filter params of the following form may be specified: **filter=**[!]*field*:*regex* may be specified.
160 +
161 + * *field* is one of the named cdx fields (listed in the JSON query) or an index of the field. It is often useful to filter by
162 + *mimetype* or *statuscode*
163 +
164 + * Optional: *!* before the query inverts the match, that is, will return results that do NOT match the regex.
165 +
166 + * *regex* is any standard Java regex pattern (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html)
167 +
168 +
169 + * Ex: Query for 2 capture results with a non-200 status code:
170 +
171 + http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=2&filter=!statuscode:200
172 +
173 +
174 + * Ex: Query for 10 capture results with a non-200 status code and non text/html mime type matching a specific digest:
175 +
176 + http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=10&filter=!statuscode:200&filter=!mimetype:text/html&filter=digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV
177 +
178 + ### Collapsing ###
179 +
180 + A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field.
181 + Collapsing is done on adjacent cdx lines where all captures after the first one that are duplicate are filtered out.
182 + This is useful for filtering out captures that are 'too dense' or when looking for unique captures.
183 +
184 + To use collapsing, add one or more **collapse=field** or **collapse=field:N** where N is the first N characters of *field* to test.
185 +
186 + * Ex: Only show at most 1 capture per hour (compare the first 10 digits of the timestamp field). Given 2 captures 20130226010000 and 20130226010800, since first 10 digits 2013022601 match, the 2nd capture will be filtered out.
187 +
188 + http://web.archive.org/cdx/search/cdx?url=google.com&collapse=timestamp:10
189 +
190 + The calendar page at web.archive.org uses this filter by default: http://web.archive.org/web/*/archive.org
191 +
192 +
193 + * Ex: Only show unique captures by digest (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected)
194 +
195 + http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=digest
196 +
197 +
198 + * Ex: Only show unique urls in a prefix query (filtering out captures except first capture of a given url). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):
199 +
200 + http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=urlkey&matchType=prefix
201 +
202 +
203 + ### Query Result Limits ###
204 +
205 + As the CDX server may return millions or billions of record, it is often necessary to set limits on a single query for practical reasons.
206 + The CDX server provides several mechanisms, including ability to return the last N as well as first N results.
207 +
208 + * The CDX server config provides a setting for absolute maximum length returned from a single query (currently set to 150000 by default).
209 +
210 + * Set **limit=** *N* to return the first N results.
211 +
212 + * Set **limit=** *-N* to return the last N results. The query may be slow as it begins reading from the beginning of the search space and skips all but last N results.
213 +
214 + Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=-1
215 +
216 + * *Advanced Option:* **fastLatest=true** may be set to return *some number* of latest results for an exact match and is faster than the standard last results search. The number of results is at least 1 so **limit=-1** implies this setting. The number of results may be greater >1 when a secondary index format (such as ZipNum) is used, but is not guaranteed to return any more than 1 result. Combining this setting with **limit=** will ensure that *no more* than N last results.
217 +
218 + Ex: This query will result in upto 5 of the latest (by date) query results:
219 +
220 + http://web.archive.org/cdx/search/cdx?url=archive.org&fastLatest=true&limit=-5
221 +
222 + * The **offset=** *M* param can be used in conjunction with limit to 'skip' the first M records. This allows for a simple way to scroll through the results.
223 +
224 + However, the offset/limit model does not scale well to large querties since the CDX server must read and skip through the number of results specified by
225 + **offset**, so the CDX server begins reading at the beginning every time.
226 +
227 +
228 + ## Advanced Usage
229 +
230 + The following features are for more specific/advanced usage of the CDX server.
231 +
232 +
233 + ### Resumption Key ###
234 +
235 + There is also a new method that allows for the CDX server to specify 'resumption key' that can be used to continue the query from the previous end.
236 + This allows breaking up a large query into smaller queries more efficiently.
237 + This can be achieved by using **showResumeKey=** and **resumeKey=** params
238 +
239 + * To show the resumption key add **showResumeKey=true** param. When set, the resume key will be printed only if the query has more results that have not be printed due to **limit=** (or max query limit) number of results reached.
240 +
241 + * After the end of the query, the *<resumption key>* will be printed on a seperate line or seperate JSON query.
242 +
243 + * Plain text example: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true
244 +
245 + ```
246 + org,archive)/ 19970126045828 http://www.archive.org:80/ text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415
247 + org,archive)/ 19971011050034 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402
248 + org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405
249 + org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405
250 + org,archive)/ 19980109140106 http://archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402
251 +
252 + org%2Carchive%29%2F+19980109140106%21
253 + ```
254 +
255 + * JSON example: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&output=json
256 +
257 + ```
258 + [["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
259 + ["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"],
260 + ["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"],
261 + ["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"],
262 + ["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"],
263 + ["org,archive)/", "19980109140106", "http://archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"],
264 + [],
265 + ["org%2Carchive%29%2F+19980109140106%21"]]
266 + ```
267 +
268 + * In a subsequent query, adding **resumeKey=** *<resumption key>* will resume the search from the next result:
269 + No other params from the original query (such as *from=* or *url=*) need to be altered
270 + To continue from the previous example, the subsequent query would be:
271 +
272 + Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&resumeKey=org%2Carchive%29%2F+19980109140106%21
273 +
274 + ### Counters ###
275 +
276 + There is some work on custom counters to enchance the aggregation capabilities of CDX server.
277 + These features are brand new and should be considered experimental.
278 +
279 + #### Duplicate Counter ####
280 +
281 + While collapsing allows for filtering out adjacent results that are duplicates, it is also possible to track duplicates throughout the cdx
282 + using a special new extension.
283 + By adding the **showDupeCount=true** a new `dupecount` column will be added to the results.
284 +
285 + * The duplicates are determined by tracking rows with the same `digest` field.
286 +
287 + * The `warc/revisit` mimetype in duplicates > 0 will automatically be resolved to the mimetype of the original, if found.
288 +
289 + * Using **showDupeCount=true** will only show unique captures: http://web.archive.org/cdx/search/cdx?url=archive.org&showDupeCount=true&output=json&limit=50
290 +
291 +
292 + #### Skip Counter ####
293 +
294 + It is possible to track how many CDX lines were skipped due to [Filtering](#filtering) and [Collapsing](#collapsing)
295 + by adding the special `skipcount` counter with **showSkipCount=true**.
296 + An optional `endtimestamp` count can also be used to print the timestamp of the last capture by adding **lastSkipTimestamp=true**
297 +
298 + * Ex: Collapse results by year and print number of additional captures skipped and timestamp of last capture:
299 +
300 + http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=timestamp:4&output=json&showSkipCount=true&lastSkipTimestamp=true
301 +
302 +
303 + ### Pagination API ###
304 +
305 + The above resume key allows for sequential querying of CDX data.
306 + However, in some cases where very large querying is needed (for example domain query), it may be useful to perform queries
307 + in parallel and also estimate the total size of the query.
308 +
309 + `wayback` and `cdx-server` support a secondary loading from a 'zipnum' CDX index.
310 + This index contains CDX lines stored in concatenated GZIP blocks (usually 3,000 lines each) and a secondary index
311 + which provides binary search to the 'zipnum' blocks.
312 + By using the secondary index, it is possible to estimate the total size of a query and also break up the query in size.
313 + Using the zipnum format or other secondary index is needed to support pagination.
314 +
315 + However, pagination can only work on a single index at a time; merging input from multiple sources (plain cdx or zipnum)
316 + is not possible. As such, the results from a paginated query may be slightly less up-to-date than
317 + a default non-paginated query.
318 +
319 + * To use pagination, simply add the **page=i** param to the query to return the i-th page. If pagination is not supported, the CDX server will return a 400.
320 +
321 + * Pages are numbered from 0 to *num pages - 1*. If *i<0*, pages are not used. If *i>=num pages*, no results are returned.
322 +
323 + Ex: First page: http://web.archive.org/cdx/search/cdx?url=archive.org&page=0
324 +
325 + Ex: Next Page: http://web.archive.org/cdx/search/cdx?url=archive.org&page=1
326 +
327 +
328 + * To determine the number of pages, add the **showNumPages=true** param. This is a special query that will return a single number indicating the number of pages
329 +
330 + Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&showNumPages=true
331 +
332 + * Page size is the number of zipnum blocks scanned per page (so a page size of `1` will contain *up to* 3,000 results per page). This means the number of results on each page will vary, because each block may have a different number of CDX lines matching your query. Page size is configured to an optimal value on the CDX server, and may be similar to max query limit in non-paged mode. The CDX server on archive.org currently has a page size of 50.
333 +
334 + * It is possible to adjust the page size to a smaller value than the default by setting the **pageSize=P** where 1 <= P <= default page size.
335 +
336 + Ex: Get # of pages with smallest page size: http://web.archive.org/cdx/search/cdx?url=archive.org&showNumPages=true&pageSize=1
337 +
338 + Ex: Get first page with smallest page size: http://web.archive.org/cdx/search/cdx?url=archive.org&page=0&pageSize=1
339 +
340 +
341 + * If there is only one page, adding the **page=0** param will return the same results as without setting a page.
342 +
343 + * It is also possible to have the CDX server return the raw secondary index, by specifying **showPagedIndex=true**. This query returns the secondary index instead of the CDX results and may be subject to access restrictions.
344 +
345 + * All other params, including the resumeKey= should work in conjunction with pagination.
346 +
347 +
348 +
349 + ### Access Control ###
350 +
351 + The cdx server is designed to improve access to archived data to a broad audience, but it may be necessary to restrict certain parts of the cdx.
352 +
353 + The cdx server provides greanting permissions to restricted data via an API key that is passed in as a cookie.
354 +
355 + Currently two restrictions/permission types are supported:
356 +
357 + * Access to certain urls which are considered private. When restricted, only public urls are included in query results and access to secondary index is restricted.
358 +
359 + * Access to certain fields, such as filename in the CDX. When restricted, the cdx results contain only public fields.
360 +
361 +
362 + To allow access, the API key cookie must be explicitly set on the client, eg:
363 +
364 + ```
365 + curl -H "Cookie: cdx-auth-token=API-Key-Secret http://mycdxserver/search/cdx?url=..."
366 + ```
367 +
368 + The *API-Key-Secret* can be set in the cdx server configuration.
369 +
370 +
371 + ## CDX Server Configuration ##
372 +
373 +
374 + TODO
375 +
376 + Start by editing the wayback-cdx-server-servlet.xml File in the WEB-INF Directory. Just put some valid CDX-Files in the cdxUris-List (Files must end with cdx or cdx.gz!)
377 +
378 +
Novější Starší