d13forme a révisé ce gist . Aller à la révision
1 file changed, 378 insertions
README.md(fichier créé)
@@ -0,0 +1,378 @@ | |||
1 | + | # Wayback CDX Server API - BETA # | |
2 | + | ||
3 | + | ##### Changelist | |
4 | + | ||
5 | + | * 2013-08-07 -- Add this changelist! Page size is now adjustable [Pagination API](#pagination-api) | |
6 | + | ||
7 | + | * 2013-08-07 -- Added support for [Counters](#counters) and [Field Order](#field-order). | |
8 | + | ||
9 | + | * 2013-08-03 -- Added support for [Collapsing](#collapsing) | |
10 | + | ||
11 | + | ||
12 | + | ##### Table of Contents | |
13 | + | ||
14 | + | #### [Intro and Usage](#intro-and-usage) | |
15 | + | ||
16 | + | * [Changelist](#changelist) | |
17 | + | ||
18 | + | * [Basic usage](#basic-usage) | |
19 | + | ||
20 | + | * [Url Match Scope](#url-match-scope) | |
21 | + | ||
22 | + | * [Output Format (JSON)](#output-format-json) | |
23 | + | ||
24 | + | * [Field Order](#field-order) | |
25 | + | ||
26 | + | * [Filtering](#filtering) | |
27 | + | ||
28 | + | * [Collapsing](#collapsing) | |
29 | + | ||
30 | + | * [Query Result Limits](#query-result-limits) | |
31 | + | ||
32 | + | #### [Advanced Usage](#advanced-usage) | |
33 | + | ||
34 | + | * [Closest Timestamp Match](#closest-timestamp-match) | |
35 | + | ||
36 | + | * [Resumption Key](#resumption) | |
37 | + | ||
38 | + | * [Resolve Revisits](#resolve-revisits) | |
39 | + | ||
40 | + | * [Counters](#counters) | |
41 | + | ||
42 | + | * [Duplicate Counter](#duplicate-counter) | |
43 | + | ||
44 | + | * [Skip Counter](#skip-counter) | |
45 | + | ||
46 | + | * [Pagination API](#pagination-api) | |
47 | + | ||
48 | + | * [Access Control](#access-control) | |
49 | + | ||
50 | + | ||
51 | + | ||
52 | + | ## Intro and Usage ## | |
53 | + | ||
54 | + | The `wayback-cdx-server` is a standalone HTTP servlet that serves the index that the `wayback` machine uses to lookup captures. | |
55 | + | ||
56 | + | The index format is known as 'cdx' and contains various fields representing the capture, usually | |
57 | + | sorted by url and date. | |
58 | + | http://archive.org/web/researcher/cdx_file_format.php | |
59 | + | ||
60 | + | The server responds to GET queries and returns either the plain text CDX data, or optionally a JSON array of the CDX. | |
61 | + | ||
62 | + | The CDX server is deployed as part of web.archive.org Wayback Machine and the usage below reference this deployment. | |
63 | + | ||
64 | + | However, the cdx server is freely available with the rest of the open-source wayback machine software in this repository. | |
65 | + | ||
66 | + | Further documentation will focus on configuration and deployment in other environments. | |
67 | + | ||
68 | + | Please contant us at wwm@archive.org for additional questions. | |
69 | + | ||
70 | + | ||
71 | + | ### Basic Usage ### | |
72 | + | ||
73 | + | The most simple query and the only required param for the CDX server is the **url** param | |
74 | + | ||
75 | + | * http://web.archive.org/cdx/search/cdx?url=archive.org | |
76 | + | ||
77 | + | The above query will return a portion of the index, one per row, for each 'capture' of the url "archive.org" | |
78 | + | that is available in the archive. | |
79 | + | ||
80 | + | The columns of each line are the fields of the cdx. | |
81 | + | At this time, the following cdx fields are publicly available: | |
82 | + | ||
83 | + | `["urlkey","timestamp","original","mimetype","statuscode","digest","length"]` | |
84 | + | ||
85 | + | It is possible to customize the [Field Order](#field-order) as well. | |
86 | + | ||
87 | + | The the **url=** value should be [url encoded](http://en.wikipedia.org/wiki/Percent-encoding) if the url itself contains a query. | |
88 | + | ||
89 | + | All other params are optional and are explained below. | |
90 | + | ||
91 | + | ||
92 | + | For doing large/bulk queries, the use of the [Pagination API](#pagination-api) is recommended. | |
93 | + | ||
94 | + | ||
95 | + | ### Url Match Scope ### | |
96 | + | ||
97 | + | The default behavior is to return matches for an exact url. However, the cdx server can also return results matching a certain | |
98 | + | prefix, a certain host or all subdomains by using the **matchType=** param. | |
99 | + | ||
100 | + | For example, if given the url: *archive.org/about/* and: | |
101 | + | ||
102 | + | * **matchType=exact** (default if omitted) will return results matching exactly *archive.org/about/* | |
103 | + | ||
104 | + | * **matchType=prefix** will return results for all results under the path *archive.org/about/* | |
105 | + | ||
106 | + | http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=prefix&limit=1000 | |
107 | + | ||
108 | + | * **matchType=host** will return results from host archive.org | |
109 | + | ||
110 | + | http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=host&limit=1000 | |
111 | + | ||
112 | + | * **matchType=domain** will return results from host archive.org and all subhosts *.archive.org | |
113 | + | ||
114 | + | http://web.archive.org/cdx/search/cdx?url=archive.org/about/&matchType=domain&limit=1000 | |
115 | + | ||
116 | + | ||
117 | + | The matchType may also be set implicitly by using wildcard '*' at end or beginning of the url: | |
118 | + | ||
119 | + | * If url is ends in '/\*', eg **url=archive.org/\*** the query is equivalent to **url=archive.org/&matchType=prefix** | |
120 | + | * if url starts with '\*.', eg **url=\*.archive.org/** the query is equivalent to **url=archive.org/&matchType=domain** | |
121 | + | ||
122 | + | (Note: The *domain* mode is only available if the CDX is in SURT-order format.) | |
123 | + | ||
124 | + | ||
125 | + | ### Output Format (JSON) ## | |
126 | + | ||
127 | + | * Output: **output=json** can be added to return results as JSON array. The JSON output currently also includes a first line which indicates the cdx format. | |
128 | + | ||
129 | + | Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=3 | |
130 | + | ``` | |
131 | + | [["urlkey","timestamp","original","mimetype","statuscode","digest","length"], | |
132 | + | ["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"], | |
133 | + | ["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"], | |
134 | + | ["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"]] | |
135 | + | ``` | |
136 | + | ||
137 | + | * By default, CDX server returns gzip encoded data for all queries. To turn this off, add the **gzip=false** param | |
138 | + | ||
139 | + | ### Field Order ### | |
140 | + | ||
141 | + | It is possible to customize the fields returned from the cdx server using the **fl=** param. | |
142 | + | Simply pass in a comma separated list of fields and only those fields will be returned: | |
143 | + | ||
144 | + | * The following returns only the timestamp and mimetype fields with the header `["timestamp","mimetype"]` http://web.archive.org/cdx/search/cdx?url=archive.org&fl=timestamp,mimetype&output=json | |
145 | + | ||
146 | + | * If omitted, all the available fields are returned by default. | |
147 | + | ||
148 | + | ||
149 | + | ### Filtering ### | |
150 | + | ||
151 | + | * Date Range: Results may be filtered by timestamp using **from=** and **to=** params. | |
152 | + | The ranges are inclusive and are specified in the same 1 to 14 digit format used for `wayback` captures: *yyyyMMddhhmmss* | |
153 | + | ||
154 | + | Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&from=2010&to=2011 | |
155 | + | ||
156 | + | ||
157 | + | * Regex filtering: It is possible to filter on a specific field or the entire CDX line (which is space delimited). | |
158 | + | Filtering by specific field is often simpler. | |
159 | + | Any number of filter params of the following form may be specified: **filter=**[!]*field*:*regex* may be specified. | |
160 | + | ||
161 | + | * *field* is one of the named cdx fields (listed in the JSON query) or an index of the field. It is often useful to filter by | |
162 | + | *mimetype* or *statuscode* | |
163 | + | ||
164 | + | * Optional: *!* before the query inverts the match, that is, will return results that do NOT match the regex. | |
165 | + | ||
166 | + | * *regex* is any standard Java regex pattern (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html) | |
167 | + | ||
168 | + | ||
169 | + | * Ex: Query for 2 capture results with a non-200 status code: | |
170 | + | ||
171 | + | http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=2&filter=!statuscode:200 | |
172 | + | ||
173 | + | ||
174 | + | * Ex: Query for 10 capture results with a non-200 status code and non text/html mime type matching a specific digest: | |
175 | + | ||
176 | + | http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=10&filter=!statuscode:200&filter=!mimetype:text/html&filter=digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV | |
177 | + | ||
178 | + | ### Collapsing ### | |
179 | + | ||
180 | + | A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field. | |
181 | + | Collapsing is done on adjacent cdx lines where all captures after the first one that are duplicate are filtered out. | |
182 | + | This is useful for filtering out captures that are 'too dense' or when looking for unique captures. | |
183 | + | ||
184 | + | To use collapsing, add one or more **collapse=field** or **collapse=field:N** where N is the first N characters of *field* to test. | |
185 | + | ||
186 | + | * Ex: Only show at most 1 capture per hour (compare the first 10 digits of the timestamp field). Given 2 captures 20130226010000 and 20130226010800, since first 10 digits 2013022601 match, the 2nd capture will be filtered out. | |
187 | + | ||
188 | + | http://web.archive.org/cdx/search/cdx?url=google.com&collapse=timestamp:10 | |
189 | + | ||
190 | + | The calendar page at web.archive.org uses this filter by default: http://web.archive.org/web/*/archive.org | |
191 | + | ||
192 | + | ||
193 | + | * Ex: Only show unique captures by digest (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected) | |
194 | + | ||
195 | + | http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=digest | |
196 | + | ||
197 | + | ||
198 | + | * Ex: Only show unique urls in a prefix query (filtering out captures except first capture of a given url). This is similar to the old prefix query in wayback (note: this query may be slow at the moment): | |
199 | + | ||
200 | + | http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=urlkey&matchType=prefix | |
201 | + | ||
202 | + | ||
203 | + | ### Query Result Limits ### | |
204 | + | ||
205 | + | As the CDX server may return millions or billions of record, it is often necessary to set limits on a single query for practical reasons. | |
206 | + | The CDX server provides several mechanisms, including ability to return the last N as well as first N results. | |
207 | + | ||
208 | + | * The CDX server config provides a setting for absolute maximum length returned from a single query (currently set to 150000 by default). | |
209 | + | ||
210 | + | * Set **limit=** *N* to return the first N results. | |
211 | + | ||
212 | + | * Set **limit=** *-N* to return the last N results. The query may be slow as it begins reading from the beginning of the search space and skips all but last N results. | |
213 | + | ||
214 | + | Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=-1 | |
215 | + | ||
216 | + | * *Advanced Option:* **fastLatest=true** may be set to return *some number* of latest results for an exact match and is faster than the standard last results search. The number of results is at least 1 so **limit=-1** implies this setting. The number of results may be greater >1 when a secondary index format (such as ZipNum) is used, but is not guaranteed to return any more than 1 result. Combining this setting with **limit=** will ensure that *no more* than N last results. | |
217 | + | ||
218 | + | Ex: This query will result in upto 5 of the latest (by date) query results: | |
219 | + | ||
220 | + | http://web.archive.org/cdx/search/cdx?url=archive.org&fastLatest=true&limit=-5 | |
221 | + | ||
222 | + | * The **offset=** *M* param can be used in conjunction with limit to 'skip' the first M records. This allows for a simple way to scroll through the results. | |
223 | + | ||
224 | + | However, the offset/limit model does not scale well to large querties since the CDX server must read and skip through the number of results specified by | |
225 | + | **offset**, so the CDX server begins reading at the beginning every time. | |
226 | + | ||
227 | + | ||
228 | + | ## Advanced Usage | |
229 | + | ||
230 | + | The following features are for more specific/advanced usage of the CDX server. | |
231 | + | ||
232 | + | ||
233 | + | ### Resumption Key ### | |
234 | + | ||
235 | + | There is also a new method that allows for the CDX server to specify 'resumption key' that can be used to continue the query from the previous end. | |
236 | + | This allows breaking up a large query into smaller queries more efficiently. | |
237 | + | This can be achieved by using **showResumeKey=** and **resumeKey=** params | |
238 | + | ||
239 | + | * To show the resumption key add **showResumeKey=true** param. When set, the resume key will be printed only if the query has more results that have not be printed due to **limit=** (or max query limit) number of results reached. | |
240 | + | ||
241 | + | * After the end of the query, the *<resumption key>* will be printed on a seperate line or seperate JSON query. | |
242 | + | ||
243 | + | * Plain text example: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true | |
244 | + | ||
245 | + | ``` | |
246 | + | org,archive)/ 19970126045828 http://www.archive.org:80/ text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415 | |
247 | + | org,archive)/ 19971011050034 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402 | |
248 | + | org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405 | |
249 | + | org,archive)/ 19971211122953 http://www.archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1405 | |
250 | + | org,archive)/ 19980109140106 http://archive.org:80/ text/html 200 XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3 1402 | |
251 | + | ||
252 | + | org%2Carchive%29%2F+19980109140106%21 | |
253 | + | ``` | |
254 | + | ||
255 | + | * JSON example: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&output=json | |
256 | + | ||
257 | + | ``` | |
258 | + | [["urlkey","timestamp","original","mimetype","statuscode","digest","length"], | |
259 | + | ["org,archive)/", "19970126045828", "http://www.archive.org:80/", "text/html", "200", "Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY", "1415"], | |
260 | + | ["org,archive)/", "19971011050034", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"], | |
261 | + | ["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"], | |
262 | + | ["org,archive)/", "19971211122953", "http://www.archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1405"], | |
263 | + | ["org,archive)/", "19980109140106", "http://archive.org:80/", "text/html", "200", "XAHDNHZ5P3GSSSNJ3DMEOJF7BMCCPZR3", "1402"], | |
264 | + | [], | |
265 | + | ["org%2Carchive%29%2F+19980109140106%21"]] | |
266 | + | ``` | |
267 | + | ||
268 | + | * In a subsequent query, adding **resumeKey=** *<resumption key>* will resume the search from the next result: | |
269 | + | No other params from the original query (such as *from=* or *url=*) need to be altered | |
270 | + | To continue from the previous example, the subsequent query would be: | |
271 | + | ||
272 | + | Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true&resumeKey=org%2Carchive%29%2F+19980109140106%21 | |
273 | + | ||
274 | + | ### Counters ### | |
275 | + | ||
276 | + | There is some work on custom counters to enchance the aggregation capabilities of CDX server. | |
277 | + | These features are brand new and should be considered experimental. | |
278 | + | ||
279 | + | #### Duplicate Counter #### | |
280 | + | ||
281 | + | While collapsing allows for filtering out adjacent results that are duplicates, it is also possible to track duplicates throughout the cdx | |
282 | + | using a special new extension. | |
283 | + | By adding the **showDupeCount=true** a new `dupecount` column will be added to the results. | |
284 | + | ||
285 | + | * The duplicates are determined by tracking rows with the same `digest` field. | |
286 | + | ||
287 | + | * The `warc/revisit` mimetype in duplicates > 0 will automatically be resolved to the mimetype of the original, if found. | |
288 | + | ||
289 | + | * Using **showDupeCount=true** will only show unique captures: http://web.archive.org/cdx/search/cdx?url=archive.org&showDupeCount=true&output=json&limit=50 | |
290 | + | ||
291 | + | ||
292 | + | #### Skip Counter #### | |
293 | + | ||
294 | + | It is possible to track how many CDX lines were skipped due to [Filtering](#filtering) and [Collapsing](#collapsing) | |
295 | + | by adding the special `skipcount` counter with **showSkipCount=true**. | |
296 | + | An optional `endtimestamp` count can also be used to print the timestamp of the last capture by adding **lastSkipTimestamp=true** | |
297 | + | ||
298 | + | * Ex: Collapse results by year and print number of additional captures skipped and timestamp of last capture: | |
299 | + | ||
300 | + | http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=timestamp:4&output=json&showSkipCount=true&lastSkipTimestamp=true | |
301 | + | ||
302 | + | ||
303 | + | ### Pagination API ### | |
304 | + | ||
305 | + | The above resume key allows for sequential querying of CDX data. | |
306 | + | However, in some cases where very large querying is needed (for example domain query), it may be useful to perform queries | |
307 | + | in parallel and also estimate the total size of the query. | |
308 | + | ||
309 | + | `wayback` and `cdx-server` support a secondary loading from a 'zipnum' CDX index. | |
310 | + | This index contains CDX lines stored in concatenated GZIP blocks (usually 3,000 lines each) and a secondary index | |
311 | + | which provides binary search to the 'zipnum' blocks. | |
312 | + | By using the secondary index, it is possible to estimate the total size of a query and also break up the query in size. | |
313 | + | Using the zipnum format or other secondary index is needed to support pagination. | |
314 | + | ||
315 | + | However, pagination can only work on a single index at a time; merging input from multiple sources (plain cdx or zipnum) | |
316 | + | is not possible. As such, the results from a paginated query may be slightly less up-to-date than | |
317 | + | a default non-paginated query. | |
318 | + | ||
319 | + | * To use pagination, simply add the **page=i** param to the query to return the i-th page. If pagination is not supported, the CDX server will return a 400. | |
320 | + | ||
321 | + | * Pages are numbered from 0 to *num pages - 1*. If *i<0*, pages are not used. If *i>=num pages*, no results are returned. | |
322 | + | ||
323 | + | Ex: First page: http://web.archive.org/cdx/search/cdx?url=archive.org&page=0 | |
324 | + | ||
325 | + | Ex: Next Page: http://web.archive.org/cdx/search/cdx?url=archive.org&page=1 | |
326 | + | ||
327 | + | ||
328 | + | * To determine the number of pages, add the **showNumPages=true** param. This is a special query that will return a single number indicating the number of pages | |
329 | + | ||
330 | + | Ex: http://web.archive.org/cdx/search/cdx?url=archive.org&showNumPages=true | |
331 | + | ||
332 | + | * Page size is the number of zipnum blocks scanned per page (so a page size of `1` will contain *up to* 3,000 results per page). This means the number of results on each page will vary, because each block may have a different number of CDX lines matching your query. Page size is configured to an optimal value on the CDX server, and may be similar to max query limit in non-paged mode. The CDX server on archive.org currently has a page size of 50. | |
333 | + | ||
334 | + | * It is possible to adjust the page size to a smaller value than the default by setting the **pageSize=P** where 1 <= P <= default page size. | |
335 | + | ||
336 | + | Ex: Get # of pages with smallest page size: http://web.archive.org/cdx/search/cdx?url=archive.org&showNumPages=true&pageSize=1 | |
337 | + | ||
338 | + | Ex: Get first page with smallest page size: http://web.archive.org/cdx/search/cdx?url=archive.org&page=0&pageSize=1 | |
339 | + | ||
340 | + | ||
341 | + | * If there is only one page, adding the **page=0** param will return the same results as without setting a page. | |
342 | + | ||
343 | + | * It is also possible to have the CDX server return the raw secondary index, by specifying **showPagedIndex=true**. This query returns the secondary index instead of the CDX results and may be subject to access restrictions. | |
344 | + | ||
345 | + | * All other params, including the resumeKey= should work in conjunction with pagination. | |
346 | + | ||
347 | + | ||
348 | + | ||
349 | + | ### Access Control ### | |
350 | + | ||
351 | + | The cdx server is designed to improve access to archived data to a broad audience, but it may be necessary to restrict certain parts of the cdx. | |
352 | + | ||
353 | + | The cdx server provides greanting permissions to restricted data via an API key that is passed in as a cookie. | |
354 | + | ||
355 | + | Currently two restrictions/permission types are supported: | |
356 | + | ||
357 | + | * Access to certain urls which are considered private. When restricted, only public urls are included in query results and access to secondary index is restricted. | |
358 | + | ||
359 | + | * Access to certain fields, such as filename in the CDX. When restricted, the cdx results contain only public fields. | |
360 | + | ||
361 | + | ||
362 | + | To allow access, the API key cookie must be explicitly set on the client, eg: | |
363 | + | ||
364 | + | ``` | |
365 | + | curl -H "Cookie: cdx-auth-token=API-Key-Secret http://mycdxserver/search/cdx?url=..." | |
366 | + | ``` | |
367 | + | ||
368 | + | The *API-Key-Secret* can be set in the cdx server configuration. | |
369 | + | ||
370 | + | ||
371 | + | ## CDX Server Configuration ## | |
372 | + | ||
373 | + | ||
374 | + | TODO | |
375 | + | ||
376 | + | Start by editing the wayback-cdx-server-servlet.xml File in the WEB-INF Directory. Just put some valid CDX-Files in the cdxUris-List (Files must end with cdx or cdx.gz!) | |
377 | + | ||
378 | + |
Plus récent
Plus ancien