I. Overview
The full name of MD5 is message-digest algorithm 5, which was developed by computer science laboratory of Massachusetts Institute of Technology and ronald l. rivest of rsa data security inc in the early 1990s, and by md2, md3 and md4. Its function is to "compress" a large amount of information into a secure format (that is, convert a byte string of any length into a large integer of a certain length) before signing the private key with digital signature software. No matter md2, md4 or md5, it is necessary to get a random length of information and generate a 128-bit information summary. Although the structures of these algorithms are more or less similar, the design of md2 is completely different from that of md4 and md5, because md2 is optimized for 8-bit computers, while md4 and md5 are designed for 32-bit computers. Internet RFCS1321(http://www.ietf.org/RFC/RFC1321.txt) describes the descriptions of these three algorithms and the source code in C language in detail, which was written by ronald l. rivest at/kloc-0.
Rivest developed the md2 algorithm in 1989. In this algorithm, firstly, information is supplemented with data, so that the byte length of information is a multiple of 16. Then, the 16 bit checksum is appended to the end of the message. And calculates a hash value based on this newly generated information. Later, rogier and chauvaud found that if the checksum was ignored, an md2 conflict would occur. The encryption result of md2 algorithm is unique and there is no duplication.
In order to strengthen the security of the algorithm, rivest developed the md4 algorithm in 1990. The md4 algorithm also needs to fill in information to ensure that the byte length of information can be divisible by 5 12 after adding 448 (information byte length mod 5 12 = 448). Then, add the initial length of the 64-bit binary message. Information is processed into 5 12-bit damg? The blocks of rd/merkle iterative structure, each block has to go through three different steps. Den boer, bosselaers and others quickly discovered the vulnerabilities that attacked the first and third steps in the md4 version. Dobbertin shows you how to use an ordinary personal computer to find the conflict in the full version of md4 in a few minutes (this conflict is actually a loophole, which will lead to the encryption of different contents but may get the same encryption result). There is no doubt that md4 has been eliminated.
Although the security of md4 algorithm has such a big loophole, it has played an important guiding role in the emergence of several information security encryption algorithms developed later. In addition to md5, there are sha- 1, ripe-md, Haval, etc.
A year later, that is, 199 1 year, rivest developed a more technically mature md5 algorithm. It adds the concept of "safety belt" on the basis of md4. Although md5 is a little slower than md4, it is safer. This algorithm obviously consists of four steps, which is slightly different from md4 design. In md5 algorithm, the necessary conditions for the size and filling of information digest are exactly the same as md4. Den boer and bosselaers have discovered the pseudo-collision in md5 algorithm, but there are no other discovered encryption results.
Van oorschot and wiener once considered a violent hash function to search for conflicts in hashes. They guessed that a machine specially designed to search for md5 conflicts (the manufacturing cost of this machine was about one million dollars in 1994) could find a conflict every 24 days on average. However, in the years from 199 1 to1200/,there was no new algorithm called md6 to replace md5, so we can see that this defect did not have much impact on the security of md5. None of the above is enough to be a problem in the practical application of md5. Moreover, because the md5 algorithm does not need to pay any copyright fees, in general (non-top secret application field. But md5 is an excellent intermediate technology even if it is used in the top secret field), md5 should be considered very safe under any circumstances.
Second, the application of the algorithm
The typical application of md5 is to generate a message digest for a message to prevent it from being tampered with. For example, under unix, many softwares have a file with the same file name and a file extension of .md5 when downloading. In this file, there is usually only one line of text, and the general structure is as follows:
MD5(tanajiya.tar.gz)= 0ca 175 b 9 c 0 f 726 a 83 1d 895 e 26933246 1
This is the digital signature of tanajiya.tar.gz document. Md5 regards the whole file as a big text message, and generates this unique md5 message digest through its irreversible string transformation algorithm. If in the process of spreading this file in the future, no matter what changes have taken place in the content of the file (including artificial modification or transmission errors caused by unstable lines when downloading, etc.). ), as long as you recalculate the md5 of this file, you will find that the information summary is different, so you can be sure that you only get an incorrect file. If there is a third-party certification authority, md5 can also prevent the "denial" of the document author, which is the so-called digital signature application.
Md5 is also widely used in encryption and decryption technology. For example, in a unix system, the user's password is encrypted by md5 (or other similar calculation method) and stored in the file system. When the user logs in, the system calculates the password entered by the user as the md5 value, and then compares it with the md5 value saved in the file system to determine whether the entered password is correct. Through such steps, the system can determine the legitimacy of the user's login system without knowing the clear password of the user. This can not only prevent users' passwords from being known by users with system administrator rights, but also increase the difficulty of password cracking to some extent.
It is for this reason that one of the most commonly used methods for hackers to decipher passwords is a method called Runzidian. There are two ways to obtain the dictionary, one is the string table used as password collected daily, and the other is generated by permutation and combination. Firstly, the md5 values of these dictionary items are calculated by md5 program, and then the md5 values of the target are found in this dictionary. Let's assume that the maximum length of the password is 8 bytes, the password can only be letters and numbers, ***26+26+ 10=62 characters, and the number of entries in the dictionary is p (62,1)+p (62,2) ...+p (62). This encryption technology is widely used in unix systems, which is also an important reason why unix systems are more robust than general operating systems.
Third, the algorithm description
The simple description of md5 algorithm can be as follows: md5 processes the input information in 5 12 bit packets, and each packet is divided into 16 32-bit packets. After a series of processing, the output of the algorithm consists of four 32-bit packets, and concatenating these four 32-bit packets will generate a hash value of 128 bits.
In the md5 algorithm, information needs to be filled in first, so that the result of 5 12 bytes is equal to 448. Therefore, the bit length of information will be extended to n*5 12+448, that is, n*64+56 bytes, where n is a positive integer. The method of filling is as follows: fill a 1 and countless zeros after the information, and don't stop filling the information with zeros until the above conditions are met. Then, after the result, the pre-filled information length expressed in 64-bit binary is appended. After these two steps, the current information word length = n * 5 12+448+64 = (n+1) * 512, that is, the length is exactly an integer multiple of 512. The reason for this is to meet the requirements of information length in post-processing.
There are four 32-bit integer parameters called chaining variables in md5, which are: a=0x0 1234567, b=0x89abcdef, c = 0xxfedcba98, and d=0x765432 10.
When these four link variables are set, the algorithm begins to enter four rounds of loop operation. Cyclic number is the number of 5 12 bit packets in the information.
Copy the above four linked variables into the other four variables: A to A, B to B, C to C, D to D.
There are mainly four cycles (only three in md4), and each cycle is very similar. The first round of 16 operation. Do a nonlinear function operation on three of A, B, C and D, each time, and then add the fourth variable, a subgroup of the text and a constant to the result. Then shift the result to the right by an indefinite number, add one of A, B, C or D, and finally replace one of A, B, C or D with the result.
Let's look at the four nonlinear functions used in each operation (one for each round).
f(x,y,z)=(x & amp; y)|((~ x)& amp; z)
g(x,y,z)=(x & amp; z)|(y & amp; (~z))
=x^y^z(x,y,z)
i(x,y,z)=y^(x|(~z))
(&Yes and | Yes or ~ Yes or No, XOR)
Interpretation of these four functions: If the corresponding bits of X, Y and Z are independent and consistent, then each bit of the result should also be independent and consistent.
F is a function of bit operation. That is, if x, y, otherwise z. The function h is a bitwise parity operator.
Suppose mj represents the j-th subpacket of the message (from 0 to 15),
& lt& ltFf(a, b, c, d, mj, s, ti) means a=b+((a+(f(b, c, d)+mj+ti)).
& lt& ltGg(a, b, c, d, mj, s, ti) means a=b+((a+(g(b, c, d)+mj+ti)).
& lt& ltHh(a, b, c, d, mj, s, ti) means a=b+((a+(h(b, c, d)+mj+ti)).
& lt& ltIi(a, b, c, d, mj, s, ti) means a=b+((a+(i(b, c, d)+mj+ti)).
& lt& lt These four rounds (64 steps) are:
first inning
ff(a,b,c,d,m0,7,0x d 76 a 478)
ff(d,a,b,c,m 1, 12,0xe8c7b756)
ff(c,d,a,b,m2, 17,0x242070db)
ff(b,c,d,a,m3,22,0xc 1bdceee)
ff(a,b,c,d,m4,7,0xf57c0faf)
ff(d,a,b,c,m5, 12,0x4787c62a)
ff(c,d,a,b,m6, 17,0xa83046 13)
ff(b,c,d,a,m7,22,0xfd46950 1)
ff(a,b,c,d,m8,7,0x698098d8)
ff(d,a,b,c,m9, 12,0x8b44f7af)
ff(c,d,a,b,m 10, 17,0xffff5bb 1)
ff(b,c,d,a,m 1 1,22,0x895cd7be)
ff(a,b,c,d,m 12,7,0x6b90 1 122)
ff(d,a,b,c,m 13, 12,0xfd987 193)
ff(c,d,a,b,m 14, 17,0xa679438e)
ff(b,c,d,a,m 15,22,0x49b4082 1)
Second round
gg(a,b,c,d,m 1,5,0xf6 1e2562)
gg(d,a,b,c,m6,9,0xc040b340)
gg(c,d,a,b,m 1 1, 14,0x265e5a5 1)
gg(b,c,d,a,m0,20,0xe9b6c7aa)
gg(a,b,c,d,m5,5,0xd62f 105d)
gg(d,a,b,c,m 10,9,0x0244 1453)
gg(c,d,a,b,m 15, 14,0xd8a 1e68 1)
gg(b,c,d,a,m4,20,0xe7d3fbc8)
gg(a,b,c,d,m9,5,0x2 1e 1cde6)
gg(d,a,b,c,m 14,9,0xc33707d6)
gg(c,d,a,b,m3, 14,0xf4d50d87)
gg(b,c,d,a,m8,20,0x455a 14ed)
gg(a,b,c,d,m 13,5,0xa9e3e905)
gg(d,a,b,c,m2,9,0xfcefa3f8)
gg(c,d,a,b,m7, 14,0x676f02d9)
gg(b,c,d,a,m 12,20,0x8d2a4c8a)
Third round
hh(a,b,c,d,m5,4,0xfffa3942)
hh(d,a,b,c,m8, 1 1,0x877 1f68 1)
hh(c,d,a,b,m 1 1, 16,0x6d9d6 122)
hh(b,c,d,a,m 14,23,0xfde5380c)
hh(a,b,c,d,m 1,4,0xa4beea44)
hh(d,a,b,c,m4, 1 1,0x4bdecfa9)
hh(c,d,a,b,m7, 16,0xf6bb4b60)
hh(b,c,d,a,m 10,23,0x befbc 70)
hh(a,b,c,d,m 13,4,0x289b7ec6)
hh(d,a,b,c,m0, 1 1,0xeaa 127fa)
hh(c,d,a,b,m3, 16,0xd4ef3085)
hh(b,c,d,a,m6,23,0x0488 1d05)
hh(a,b,c,d,m9,4,0xd9d4d039)
hh(d,a,b,c,m 12, 1 1,0xe6db99e5)
hh(c,d,a,b,m 15, 16,0x 1fa27cf8)
hh(b,c,d,a,m2,23,0xc4ac5665)
The fourth round
Two (a, b, c, d, m0, 6, 0xf4292244)
Two (d, a, b, c, m7, 10, 0x 432 af 97)
Two (c, d, a, b, m 14, 15, 0xab9423a7)
Two (b, c, d, a, m5, 2 1, 0xfc93a039)
Two (a, b, c, d, m 12, 6, 0x655b59c3)
Two (d, a, b, c, m3, 10, 0x8f0ccc92)
Two (c, d, a, b, m 10, 15, 0xffeff47d)
Two (b, c, d, a, m 1, 2 1, 0x85845dd 1)
Two (a, b, c, d, m8, 6, 0x6fa87e4f)
Two (d, a, b, c, m 15, 10, 0xfe2ce6e0)
2 (c, d, a, b, m6, 15, 0xa30 143 14)
2 (b, c, d, a, m 13, 2 1, 0x4E0811)
Two (A, B, C, D, m4, 6, 0xf7537e82)
Two (d, a, b, c, m 1 1, 10, 0xbd3af235)
Two (C, D, A, B, m2, 15, 0x2ad7d2bb)
Two (b, c, d, a, m9, 2 1, 0xeb86d39 1)
The constant ti can be selected as follows:
In step I, ti is an integer part of 4294967296*abs(sin(i)), and the unit of I is radian. (4294967296 equals 2 to the 32nd power)
After all this is done, add a, b, c and d respectively. Then continue to run the algorithm on the next data packet, and the final output is the cascade of A, B, C and D. ..
When you implement the md5 algorithm according to the method I mentioned above, you can use the following information to do a simple test on your program to see if there are any errors in the program.
MD5(" " = d 4 1 D8 CD 98 f 00 b 204 e 9800998 ECF 8427 e
MD5(" a ")= 0cc 175 B9 c0f 1b6a 83 1c 399 e 26977266 1
MD5(" ABC ")= 900 150983 CD 24 FB 0d 6963 f 7d 28 e 17f 72
Md5 ("message digest") = f96b697d7cb7938d525a2f31AAF161d0.
MD5(" abcdefghijklmnopqrstuvwxyz ")= C3 fcd 3d 76 192 e 4007 DFB 496 CCA 67 e 13b
MD5(" abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz 0 123456789 ")= d 174 ab 98d 277d 9 F5 a 56 1 1c2c 9 f 4 19d 9 f
MD5(" 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 16789065438
If you use the above information to test the example of md5 algorithm you have done, and the final conclusion is exactly the same as the standard answer, then I would like to say congratulations here. You know, my program didn't get the same result as above when it was compiled successfully for the first time.
Fourthly, the safety of MD5.
Improvement of md5 on md4;
1. Added the fourth round;
2. Each step has a unique addition constant;
3. In order to weaken the symmetry of function G in the second round, function G is composed of (x&: y)|(x & amp; z)|(y & amp; Z) becomes (x&; z)|(y & amp; (~ z));
4. The first step plus the result of the previous step will cause a faster avalanche effect;
5. Changed the order of the second and third access message subpackets, making them more different;
6. The displacement of each cycle moving to the left is approximately optimized to achieve a faster avalanche effect. The displacement of each wheel is different from each other.