Talking about the browser http caching mechanism

Talking about the browser http caching mechanism

The analysis of the browser's http cache can be regarded as a cliche, and a good article will appear every once in a while. The principle is almost a must-test question in interviews for major companies.

The reason why I wrote such an article is because I have been working on new technologies recently, I want to "return" to the basics, and I hope to summarize it in more detail.

So do you still need to read this article? You can try to answer the following question:

When we visit the Baidu homepage, we will find that no matter how the page is refreshed, the static resource basically returns 200 (from cache) :

It's so good to click on a static resource at random:

Oops, there is Response header data. It seems that the server has returned etag and everything, so shouldn't the state 200 correspond to the non-cached state? Isn't it reasonable to return 304 if you want from cache?

Is it because Du Niang's server is down?

If you know the answer, you can ignore this article.

The header field related to the cache in the http message

Let's take a look at the fields related to the cache in the header fields of the 47 HTTP messages specified by RFC2616 . Knowing in advance will give us a bottom line:

1. Common header fields (that is , fields that can be used in both request and response messages)

2. Request header field

3. Response header field

4. Entity header field

They will be introduced in general in the follow-up.

Scene simulation

In order to facilitate the simulation of various cache effects, we build a very simple scene.

1. Page file

We build a very simple html page with only a local style file and image:

Cache test

rel = "stylesheet" href = "css/reset.css" >

Brother is just a title

src = "img/dog.jpg" />

2. Modification of the header field

Sometimes some browsers will add some fields to the request header by themselves (for example, when chrome uses F5, it will be forced to add " cache-control:max-age=0 ") , which will override the functions of some fields (such as pragma) ; in addition, there are At that time, we hope that the server can return more/less response fields.

In this case, we hope that we can manually modify the content of the request or response message. So how to achieve it? Here we use Fiddler to complete the task.

In Fiddler, we can intercept the specified request through the "bpu XXX" instruction, and then manually modify the request content and then send it to the server, and modify the response content and then send it to the client.

Taking our example as an example, the page file can be accessed directly through http://localhost/through nginx, so we directly execute "bpu localhost" to intercept all requests with the word in the address:

Click the intercepted request, you can directly modify the message content in the right column (the upper half is the request message, and the lower half is the response message), click the yellow "Break on Response" button to execute the next step (send the request To the server), click the green button "Run to Completion" to directly complete the entire request process:

Through this method, we can easily simulate various http caching scenarios.

3. Mandatory browser policy

As mentioned above, most browsers will add the "Cache-Control:max-age=0" request field when they click the refresh button or press F5, so we first agree to the custom-the "refresh" mentioned later Multi-referencing is to select the URL address bar and press the Enter key (so that it will not be forced to add Cache-Control) .

In fact, some browsers have some stranger behaviors, which will be mentioned when we answer the questions at the beginning of the article.

Caching in the Stone Age

In the era of http1.0, setting the caching method for the client can be regulated by two fields-"Pragma" and "Expires". Although these two fields can be discarded long ago, in order to make the http protocol backward compatible, you can still see that many websites still carry these two fields.

1. Pragma

When the value of this field is "no-cache" (in fact, only the optional value is indicated in the RFC now) , the client will be notified not to read the cache for the resource, that is, it has to send a request to the server every time. Row.

Pragma is a common header field. When used on the client, we are usually required to add this meta meta tag to the html (and may have to do some hacks and put it behind the body ):

http-equiv = "Pragma" content = "no-cache" >

It tells the browser not to read the cache every time it requests a page, it has to send a request to the server.

BUT!!! In fact, this form of disabling caching is of limited use:

1. Only IE can recognize the meaning of this meta tag, and other mainstream browsers can only recognize the meta tag of "Cache-Control: no-store" (see source ) .
2. Recognizing the meaning of the meta tag in IE, it is not necessary to add a Pragma to the request field, but it does make the current page send a new request every time (only the page, the resources on the page are not affected) .

After testing, I found that it is indeed the case. This form of client-defined Pragma basically does not play much role.

However, if this field is added to the response message, it will be different:

As shown in the above figure, the red box is the request generated when the page is refreshed again, which means that the disabling of the cache takes effect. It is expected that the browser will mark the resource after receiving the Pragma field of the server, disable its caching behavior, and then refresh the page every time. Reissue the request without going to the cache.

2. Expires

With Pragma to disable caching, naturally there needs to be something to enable caching and define the caching time. For http1.0, Expires is the first field to do this.

The value of Expires corresponds to a GMT (Greenwich Mean Time) , such as "Mon, 22 Jul 2002 11:12:01 GMT" to tell the browser resource cache expiration time, if it has not passed the time point, no request will be sent.

On the client side, we can also use meta tags to inform IE (only IE can recognize ) the cache time of the page (also only valid for the page, and invalid for the resources on the page) :

http-equiv = "expires" content = "mon, 18 apr 2016 14:30:00 GMT" >

If you want the page not to be cached under IE, and you want to make a new request every time you refresh the page, you can write the value in "content" as "-1" or "0".

Note that this method is only used as a marker to notify the IE cache time, and you cannot find the Expires field in the request or response message.

If the Expires field is returned in the server header, the resource cache time can be set correctly in any browser:

In the above picture, the cache time is set to an expired time point (see the red box) , then refreshing the page will resend the request (see the blue box) .

So if Pragma and Expires play together, who will listen to? We will find out after a try:

We disable caching through Pragma and define a time that has not yet expired (red box) for Expires . When refreshing the page, we found that a new request (blue box) was initiated , which means that the priority of the Pragma field will be higher.

BUT, the cache time defined by Expires in the response message is relative to the time on the server. If the time on the client is inconsistent with the time on the server (especially if the user modifies the system time of his computer) , then the cache time It might not make much sense.

Cache-Control

In response to the above-mentioned "Expires time is relative to the server and cannot be guaranteed to be consistent with the client time", http1.1 adds Cache-Control to define the cache expiration time. If Pragma, Expires and Cache appear in the message at the same time -Control, will be subject to Cache-Control.

Cache-Control is also a common header field, which means it can be used in request and response messages respectively. The format of Cache-Control is standardized in the RFC as:

"Cache-Control" ":" cache - directive

When used as the request header, the optional values of cache-directive are:

As the response header, the optional values of cache-directive are:

We can still add meta tags to the HTML page to add the Cache-Control field to the request header:

In addition, Cache-Control allows free combination of optional values, for example:

Cache - Control : max - age = 3600 , must - revalidate

It means that the resource is obtained from the original server, and the effective time of its cache (freshness) is one hour. In the following one hour, the user does not need to send a request to access the resource again.

Of course, there are some restrictions on this combination. For example, no-cache cannot be used with max-age, min-fresh, and max-stale.

The combined form can also be compatible with some inconsistent browser behaviors. For example, in IE, we can use no-cache to prevent page resources from being loaded from the cache when the "back" button is clicked, but in Firefox, we need to use no-store to prevent the browser from not reading data from the cache when the history is rolled back. Therefore, we can do compatibility processing by adding the following combination values to the response header:

Cache - Control : no - cache , no - store

Cache check field

The above header fields allow the client to decide whether to send a request to the server. For example, if the set cache time has not expired, then it is natural to directly fetch the data from the local cache (in chrome, it is shown as 200 from cache). If the cache time expires Or the resource should not go directly to the cache, it will send a request to the server.

The question we are going to talk about now is, if the client sends a request to the server, does it mean that the entire entity content of the resource must be read back?

Let's try to think this way-the cache time of a certain resource on the client has expired, but the server has not updated the resource at this time. If the amount of data on this resource is large, the client requires the server to resend this thing. Is it a waste of bandwidth and time to come over again?

The answer is yes, so is there a way to let the server know that the client s current cache files are in fact consistent with all of its own files, and then directly tell the client to say, "You can just use the cached files for this thing. I haven't updated it here, so I won't pass it once again."

In order to verify whether the cache file is updated between the client and the server, and to improve the reuse rate of the cache, Http1.1 adds several new header fields to do this.

1. Last-Modified

When the server passes the resource to the client, it will add the last modified time of the resource in the form of "Last-Modified: GMT" to the entity header and return it to the client.

The client will mark the information for the resource, and the next time it requests it again, it will attach the information to the request message and bring it to the server for inspection. If the passed time value is consistent with the final modification time of the resource on the server Yes, it means that the resource has not been modified, just return the 304 status code directly.

There are two header fields in the request message for transmitting the marked final modification time:

If-Modified-Since: Last-Modified-value

Example is If - Modified - Since : Thu , 31 Mar 2016 07 : 07 : 52 GMT

The request header tells the server that if the last modification time sent by the client is the same as that on the server, it can send back 304 and the response header directly.

Currently, all browsers use the request header to pass the saved Last-Modified value to the server.

 If-Unmodified-Since: Last-Modified-value

Tell the server that if the Last-Modified does not match (the last update time of the resource on the server has changed) , it should return a 412 (Precondition Failed) status code to the client.

When the following situations are encountered, the If-Unmodified-Since field will be ignored:

1. The Last - Modified value is up (there is no new modification of the resource on the server);

2. The server needs to return status codes other than 2XX and 412 ;

3. The specified date passed is illegal

Last-Modified is good but not particularly good, because if a resource is modified on the server, but its actual content does not change at all, the entire entity will be returned to the client because the Last-Modified time does not match ( Even if there is an identical resource in the client cache) .

2. ETag

In order to solve the above-mentioned Last-Modified possible inaccuracy problem, Http1.1 also introduced the ETag entity header field.

The server will calculate a unique identifier (such as the md5 logo) for the resource through an algorithm, and when responding to the resource to the client, it will add "ETag: unique identifier" to the header of the entity and return it to the client.

The client will keep the ETag field and bring it to the server in the next request. The server only needs to compare whether the ETag from the client is consistent with the ETag of the resource on its own server, and it can judge whether the resource has been modified relative to the client.

If the server finds that the ETag does not match, it directly sends the new resource (including the new ETag) to the client in the form of a regular GET 200 return packet ; if the ETag is consistent, it directly returns 304 to inform the client directly Use the local cache.

So how does the client pass the ETag marked on the resource to the server? There are two header fields in the request message that can carry the ETag value:

If-None-Match: ETag-value

The example is If - None - Match : "56fcccc8-1699"

Tell the server to resend the resource data if the ETag does not match, otherwise just send back the 304 and response header directly.

Currently, all browsers use the request header to pass the saved ETag value to the server.

If-Match: ETag-value

Tell the server that if it does not match the ETag, or receives the "*" value and there is no such resource entity currently, it should return a 412 (Precondition Failed) status code to the client. Otherwise, the server simply ignores this field.

An application scenario of If-Match is that the client uses the PUT method to request the server to upload/replace resources. At this time, the ETag of the resource can be passed through If-Match.

It should be noted that if the resources are stored on distributed servers (such as CDN), the algorithms for calculating the unique value of ETag on these servers need to be consistent, so that the same file will not be generated on server A and server B. The ETag is different.

If Last-Modified and ETag are used at the same time, both of them must be verified before returning 304. If one of the verifications fails, the server will return the resource entity and the 200 status code as usual.

These two functions are enabled by default on the newer nginx:

The first three requests in the above figure are original requests, and the next three requests are new requests after refreshing the page. Before sending a new request, we modified the reset.css file, so its Last-Modified and ETag have changed, so the server The new file was returned to the client (status value is 200) .

We did not modify dog.jpg, and its Last-Modified and ETag remain unchanged on the server side, so the server directly returns a 304 status code for the client to use the cached dog.jpg directly, without returning the entity content To the client (because it is not necessary) .

Cache practice

When we do http caching applications on a project, we will still use most of the header fields mentioned above. For example, we use Expires to be compatible with old browsers, and Cache-Control to use caching more accurately. Then turn on the ETag and Last-Modified functions to further reuse the cache to reduce traffic.

So there is a small question here-what should be the appropriate values for Expires and Cache-Control?

The answer is that there will not be too precise values, all of which require on-demand evaluation.

For example, requests for page links generally do not need to be cached for a long time, so as to ensure that the request can be reissued when returning to the page. Baidu homepage uses Cache-Control: private, and Tencent homepage is cached for 60 seconds, namely Cache-Control:max-age=60.

For static resources, especially image resources, a longer cache time is usually set, and this time is best to be flexibly modified on the client side. Take a picture of Tencent as an example:

http : //i.gtimg.cn/vipstyle/vipportal/v4/img/common/logo.png?max_age=2592000

The client can define the cache time returned by the server by adding the "max_age" parameter to the picture:

Of course, this requires a prerequisite-static resources can ensure that no changes are made for a long time. If a script file responds to the client and caches it for a long time, and the server modifies the file in the near future, the client that has cached the script will not be able to obtain new data in time.

The solution to this problem is also simple-move the server-side ETag set to the front end to use-the static resources of the page are published in version form, the common method is to put a string of md5 or time in the file name or parameter Tag:

HTTPS : //hm.baidu.com/hm.js?e23800c454aa573c0ccb16b52665ac26

HTTP : //tb1.bdstatic.com/tb/_/tbean_safe_ajax_94e7ca2.js

HTTP : //img1.gtimg.com/ninja/2/2016/04/ninja145972803357449.jpg

If the file is modified, the tag content is changed, which ensures that the client can receive the newly modified file from the server in time.

Question about the beginning

Looking back at the question at the beginning of the article, you may find the answer is easy to answer.

After the resources on the Baidu homepage were refreshed, no request was actually sent, because the cache time period defined by Cache-Control has not expired yet. Even if no request is sent in Chrome, as long as it is fetched from the local cache, a pseudo request with a status of 200 and annotated "from cache" will be displayed on the Network panel. The response content is only the data left in the last reply.

However, this is not the full answer to the question. As we mentioned earlier, if you click the "Refresh" button in Chrome, Chrome will force all resources to add the request header "Cache-Control: max-age=0" to the server The verification request was sent. In the animation at the beginning of the article, we did click the "Refresh" button, but the browser did not send a new request (and returned 304) .

I have actually discussed this issue with my friends in the group. I found through Fiddler packet capture that if I close the Chrome developer panel and click the "Refresh" button, the browser will send a verification request as expected and receive a 304 response back. In addition, the frequency of this strange situation is inconsistent on different websites and even different computers, so I temporarily attribute it to the strange response of the browser.

Then there is such a question-is there a way to prevent the browser from sending a new verification request when the browser clicks the "refresh" button?

There is a way, but it's not very practical-add resources dynamically through scripts after the page is loaded:

$ ( window ) . load ( function ( ) {

var bg = 'http://img.infinitynewtab.com/wallpaper/100.jpg' ;

setTimeout ( function ( ) { //setTimeout is required

$ ( '#bgOut' ) . css ( 'background-image' , 'url(' + bg + ')' ) ;

} , 0 ) ;

} ) ;

The source comes from Zhihu , and you can check it out for a more specific explanation.

Other related header fields

In fact, we have finished introducing the more commonly used and important cache-related fields. Here are a few incidentally related to the cache, but not the main response header fields.

1. Vary

"Vary" itself means "change", but in http messages it tends to mean "vary from" (different from ...) , which indicates what base field the server will use to distinguish and filter the cached version.

Let us first consider such a problem-with such an address on the server, if it is an IE user, it will return the content developed for IE, otherwise it will return the content of another mainstream browser version. This is very simple, the server can get the User-Agent field of the request for processing. However, the user requests the proxy server instead of the original server, and if the proxy server directly sends the cached IE version resources to non-IE clients, this is a problem.

Therefore, Vary is the first field to deal with the problem. We can add in the response message:

It can tell the proxy server that it needs to distinguish the cache version with the User-Agent request header field to prevent the incorrect cache from being passed to the client.

Vary also accepts the form of conditional combination:

Vary : User - Agent , Accept - Encoding

This means that the server should distinguish the cached version with two request header fields User-Agent and Accept-Encoding.

2. Date and Age

HTTP does not provide a way to help users distinguish whether the resources they receive hit the proxy server's cache, but on the client side, we can get the answer by calculating the Date and Age fields in the response message.

Date is of course the time (GMT format) when the original server sent the resource response message. If you find that the time of Date is very different from the "current time", or if you continue to refresh F5 and find that the value of Date has not changed, it means your current request It hits the cache of the proxy server.

The above-mentioned "current time" is naturally relative to the time of the original server, so how do you know the current time of the original server?

It is usually obtained from the response message of the page address request. Take the homepage of the blog garden as an example:

Every time you refresh the page, the browser will reissue the request for this url, and you will find that its Date value is constantly changing, which means that the link does not hit the cache, and it is the data returned from the original server.

Therefore, we can compare the Date in the response packet of other static resources on the page with it. If the Date of the static resource is earlier than the original server time, it means that the proxy server cache is hit.

Usually also meets such a condition:

Static resource Age + static resource Date = original server Date

Age here is also the header field in the response message, which represents the time (seconds) that the file exists in the proxy server . If the file is modified or replaced, Age will start accumulating from 0 again.

In the same scenario as the screenshot of the message on the homepage of the blog garden above, let's see if a certain file (jQuery.js) hits the packet data cached by the proxy server:

You will find that it meets our above rules:

//return true

new Date ( 'Mon, 04 Apr 2016 07:03:17 GMT' ) /1000 == new Date ( 'Sat, 19 Dec 2015 01:29:14 GMT' ) /1000 + 9264843

However, this rule is not necessarily accurate, especially when the original server frequently changes the system time.

The knowledge about http caching principle is organized here, I hope you can gain something, and encourage each other~