# Defects of javascript language number operation

## Preface

Anyone who has gotten started with js knows that the mathematical operations of js are flawed. Of course, this problem does not only appear in Javascript. Almost all programming languages use IEEE-745 floating point number notation. Any use The programming language of binary floating-point numbers will have this problem, but in many other languages, methods have been encapsulated to avoid the problem of precision. JavaScript is a weakly typed language, and there is no such thing as floating-point numbers from the design concept. Strict data types, so the problem of accuracy errors is particularly prominent. But perhaps few people will wonder why there is such a problem. In this article, we will start with the basics of computer, tell you why there are such problems, and the ways to solve these defects.

## 1. How is the decimal system converted to binary?

Everyone knows that the data stored inside the computer is all binary, no matter what type of data, it will be converted to binary storage and calculated at the end. Decimal is no exception. In the code, we write an assignment:, `const a = 1`this code is converted into a binary number that the computer can recognize, so that the CPU can execute it. The numbers `1`will also be converted to binary, so what is the conversion law between the two?

For example, the integer 1, converted to binary is:

0000 0000 0000 0001

Next, we talk about how positive integers, negative integers, and floating-point numbers are converted to binary, so as to lay the foundation for the following introduction

### 1.2. Convert positive integers to binary

A simple summary of the method is: Divide by two and take the remainder, then arrange in reverse order, and fill in the high bits with zeros.

Take a chestnut, the number 13, according to the above method to divide by two to get the remainder:

13/2 = 6 --- 1 6/2 = 3 --- 0 3/2 = 1 --- 1 1/2 = 1 --- 0

Then discharge in reverse order: 1011

Then the high bit is zero-filled (assuming the memory is 32-bit): 0000 0000 0000 0000 0000 0000 0000 1011

### 1.3. Convert negative integers to binary

In computers, negative numbers are stored as the complement of their positive values.

So what is the complement? You can know all the children's shoes with a little computer foundation, right?

The computer has three major codes: original code, inverse code, and complement code.

The original code is the number converted from a positive integer just now

The inverse code is a new binary number obtained by inverting the original code bit by bit, such as the 13 just now:

Original code: 0000 0000 0000 0000 0000 0000 0000 1011

Inverse code: 1111 1111 1111 1111 1111 1111 1111 0100

The complement is the inverse code plus 1, that is:

Complement: 1111 1111 1111 1111 1111 1111 1111 0101

The obtained complement is the storage form of -13 in the memory, and the hexadecimal representation is: 0xFFFF FFF5

So the conversion of negative integers is three more steps than the positive integers just now:

1. Take the absolute value of the negative integer to get the result A
2. Convert a positive integer A to binary, the result is B
3. Reverse each bit of B to get the result C
4. Add 1 to result C

### 1.4. Convert floating-point numbers to binary

Converting decimals to binary uses the method of "multiplying by 2 and rounding in order". The specific method is to multiply the decimal fraction by 2 to get the product. Take out the integer part of the product, and then multiply the remaining fractional part by 2 to get another product. Take out the integer part of the product, and do so until the fractional part in the product is zero, at this time 0 or 1 is the last digit of the binary, or the required precision is reached.

For example: 0.125 to binary:

0.125 2 = 0.25 ---> 0 0.25 2 = 0.5 ---> 0 0.5 * 2 = 1.0 ---> 1

So its binary is (0.001)B

Let s reversely verify: 0.125 = 0 2^(-1) + 0 2^(-2) + 1 * 2^(-3) = 1/8 = 0.125

The result is valid.

The principle of the above conversion method can be referred to: decimal to binary

## 2. JS double-precision format storage

Js's digital storage standard is IEEE 754 , and the standard uses 64-bit double-precision floating-point numbers, where:

Bit 0: sign bit, s means, 0 means positive number, 1 means negative number;

1st to 11th: store the exponent part, e means;

The 12th to the 63rd: store the decimal part (that is, significant digits), f means

As shown below:

To finally understand the final storage format of numbers, we also need to know scientific notation

The expression is: `m b^n (1 m <b n )`

Give me :

1234 = 1.234 10^3

Of course there is also binary notation:

Give a chestnut:

Decimal floating-point number 7.25 is converted to binary representation: 111.01 (don t ask me how to convert it, I don t know how to continue reading the article in the first section above), expressed in binary scientific notation:

1.1101 2^2, we can verify it: (1 2^0 + 1 2^(-1) + 1 2^(-2) + 0 2^(-3) + 1 2^(-4)) 2 ^2 = (1+0.5+0.25+0.0625) 2^2 = 7.25

Now that we know the binary scientific notation conversion, we have to talk about some rules of IEEE 754:

1. The sign bit determines the sign of a number, the exponent part determines the magnitude of the value, and the decimal part determines the accuracy of the value.

2. According to IEEE 754, the first digit of the significant number M is always 1 by default, and is not stored in a 64-bit floating point number. In other words, the significant figures are always in the form of 1.xx...xx, where the xx..xx part is stored in a 64-bit floating point number, and the longest may be 52 digits, but it can represent 53 significant digits. For example, when saving 1.01, only 01 is saved, and when it is read, the first 1 is added. The purpose of this is to save 1 significant figure.

3. E is an unsigned integer (unsigned int). Because E is 11 bits, its value range is 0-2047. However, we know that E in scientific notation can appear negative numbers, so IEEE 754 stipulates that the true value of E must be added with an intermediate number before it can be stored in the memory. For an 11-bit E, the intermediate number is 1023. For example, in 7.25 just now, E is 2, and it should be 2+127=129 when stored in the memory, which is 000 10000 0001

4. In addition, E also needs to consider the following three situations:

(1) E is not all 0 or not all 1. At this time, the floating-point number is expressed by the above rule, that is, the calculated value of the exponent E is subtracted from 127 (or 1023) to obtain the true value, and then the first digit 1 is added to the effective number M.

(2) E is all 0. At this time, the exponent E of the floating-point number is equal to 0-127 (or 0-1023), and the effective digit M is no longer added to the first 1 but is reduced to a decimal of 0.xxxxxx. This is done to represent 0, and very small numbers close to zero.

(3) E is all 1. At this time, if the significant digits M are all 0, it means infinity (positive and negative depends on the sign bit s); if the significant digits M are not all 0, it means that the number is not a number (NaN)

Integrating all the above information, we use the following example:

The format of floating-point number 0.125 stored in memory is calculated according to the following steps:

1. 0.125 to binary: 0.001
2. Convert from binary to scientific notation: 1.0 * 2^-3
3. According to the above expression, its E=-3+1023, M=0, so the storage content is as follows:

The reverse direction can be directly converted to decimal:

1. 0 means positive
2. E meets the first situation mentioned above: the true value is 1020-1023=-3
3. The effective number M is 0

So the scientific notation of binary is: 1.0*2^-3 => converted to decimal to 0.125

## 3. Defects of js floating-point operations

With the foundation of the above two subsections consolidated, now we can explain why 0.1+0.2 in js is equal to a long string of numbers? 0.1*0.2 is also a long string of numbers?

The root cause is that the conversion of 0.1 to binary is an infinite loop:

0.1 => 0.0 0011 0011 0011...

Because the effective number in the double-precision format is 52 bits, the complete value after 0.1 conversion is 0.0 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011

Convert to scientific notation: 1.1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 * 2^-4

So the real storage structure is:

Similarly, the real storage structure of 0.2 is:

The actual stored bit pattern is used as an operand for floating-point number addition to get 0-01111111101-00110011001100110011001100110011001100110011001100110100

Converted to decimal is 0.300000000000000044408920985006

Reference: 0.300000000000000044408920985006

But why does javascript only print the first 17 significant digits?

The reason can be viewed in my question on stackoverflow

How does JavaScript determine the number of digits to produce when formatting floating-point values?

Another classic example is: 0.1 + 0.125 Why doesn't the result have many decimal points?

Because the two are added together to get: 0.225000000000000005551115123126

According to the above explanation, 0.22500000000000000 is obtained, and the last 0 is 0.225.

Reference: 0.225000000000000005551115123126

## 3. js super large integer overflow

Because all numbers need to be converted to binary, and the converted binary is then stored in the memory unit using scientific notation, so we know that the length of M determines the range of numbers that can be represented.

The fixed length of M is 52 bits, plus the omitted one, the maximum number that can be represented is 2^53-1 = 9007199254740991, so the range is -9007199254740991 ~ 9007199254740991

It can be confirmed by the constants defined by JS:

`Number.MIN_SAFE_INTEGER`with`Number.MAX_SAFE_INTEGER`

What if you store a value that exceeds this maximum integer? What will happen?

The answer is of course overflow. So now we not only need to know the overflow, but also want to know the law of overflow?

Because M is at most 53 digits, any excess digits will overflow and be truncated. Let's give an example :

The binary representation of 9007199254740992:

100000000000000000000000000000000000000000000000000000

Because only 52 bits can be stored, this value will be truncated:

1.0000000000000000000000000000000000000000000000000000 * 2^53

Save to memory is:

Take it out to restore, you can restore to the original value, because after adding 53 zeros, you get the original binary number, so you can get the original number after conversion: 9007199254740992.

The binary is: 100000000000000000000000000000000000000000000000000001

Because of the limitation of M, only 52 bits can be saved, so the last 1 will be truncated:

1.0000000000000000000000000000000000000000000000000000 * 2^53

Save to the memory will be the same as 9007199254740992, so after restoring back to decimal, because the last digit cannot be restored, which is 1, the decimal number obtained is still 9007199254740992,

Therefore, if you write in the console, it is judged to be true: 9007199254740992 === 9007199254740993

Do you see any pattern after giving such an example?

The overflowed number will be truncated, resulting in several corresponding numbers that will be equal.

Regarding this point, the cattle people have summed up the corresponding law:

1. The number between (2^53, 2^54) will choose one of two, which can only accurately represent an even number. 2. The number between (2^54, 2^55) will choose one of four, which can only be accurate Represents 4 multiples 3,.. Skip more multiples of 2 in turn

The picture below is a good explanation of the severity of the lack of precision caused by integer overflow:

## 4. Solution

1. Use mathjs
2. Use number-precision
3. Use bigjs
4. Write a few simple and usable arithmetic functions by yourself, which can be found on the Internet.

The trade-offs and choices of these methods are determined according to project conditions.

This is the end of the introduction. If there are any omissions in the article, you can leave a message to point out, thank you.