patents.google.com

CN114048180B - A cloud storage file deduplication method based on link technology - Google Patents

  • ️Tue Jan 21 2025

CN114048180B - A cloud storage file deduplication method based on link technology - Google Patents

A cloud storage file deduplication method based on link technology Download PDF

Info

Publication number
CN114048180B
CN114048180B CN202111323766.XA CN202111323766A CN114048180B CN 114048180 B CN114048180 B CN 114048180B CN 202111323766 A CN202111323766 A CN 202111323766A CN 114048180 B CN114048180 B CN 114048180B Authority
CN
China
Prior art keywords
file
link
cloud storage
content
key value
Prior art date
2021-11-10
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111323766.XA
Other languages
Chinese (zh)
Other versions
CN114048180A (en
Inventor
冯晓军
贺晟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
2021-11-10
Filing date
2021-11-10
Publication date
2025-01-21
2021-11-10 Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
2021-11-10 Priority to CN202111323766.XA priority Critical patent/CN114048180B/en
2022-02-15 Publication of CN114048180A publication Critical patent/CN114048180A/en
2025-01-21 Application granted granted Critical
2025-01-21 Publication of CN114048180B publication Critical patent/CN114048180B/en
Status Active legal-status Critical Current
2041-11-10 Anticipated expiration legal-status Critical

Links

  • 238000000034 method Methods 0.000 title claims abstract description 26
  • 238000005516 engineering process Methods 0.000 title claims abstract description 21
  • 238000004364 calculation method Methods 0.000 claims abstract description 10
  • 238000010586 diagram Methods 0.000 description 7
  • 230000009286 beneficial effect Effects 0.000 description 2
  • 230000001419 dependent effect Effects 0.000 description 2
  • 238000011084 recovery Methods 0.000 description 2
  • 238000004458 analytical method Methods 0.000 description 1
  • 230000005540 biological transmission Effects 0.000 description 1
  • 238000004891 communication Methods 0.000 description 1
  • 230000007547 defect Effects 0.000 description 1
  • 230000000977 initiatory effect Effects 0.000 description 1
  • 238000012986 modification Methods 0.000 description 1
  • 230000004048 modification Effects 0.000 description 1
  • 238000012163 sequencing technique Methods 0.000 description 1
  • 230000003068 static effect Effects 0.000 description 1

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cloud storage file deduplication method based on a link technology, which is characterized in that link files of repeated files are created and stored in a cloud storage, link information is stored by the link files, the link information of each link file sequentially records the position of the repeated file in the cloud storage, the series connection of a real file and each repeated file is realized, and a user accesses any repeated file to find the real file through the traversal of the link file. Because the link files are small, and the link information is stored in the cloud storage and does not depend on local storage, the characteristics of single-point calculation and multi-point independent access of file deduplication in a distributed environment are facilitated, the problems of overlarge occupation of cloud storage space and high bandwidth cost caused by excessive repeated files are effectively solved, and meanwhile, the method can be simply and conveniently applied to a distributed system.

Description

Cloud storage file deduplication method based on link technology

Technical Field

The invention belongs to the technical field of cloud storage and file systems, and particularly relates to a cloud storage file deduplication method based on a link technology.

Background

In the current context, cloud storage technology is often applied among multiple scenarios. For example, an operator applies for cloud storage service, opens up a user picture and video storage space, uses the cloud storage space to share own storage files for file archiving and data disaster recovery, and uses cloud storage for distributed static file caching by a data service provider. These application scenarios have some common characteristics, namely that the storage space is occupied, the access frequency and the bandwidth are occupied and grow fast, and the same file contains multiple copies and the like.

If one wants to store the most files at the smallest cost, one will bring about a total reduction in storage and transmission costs. Therefore, how to reduce the file storage cost becomes one of the important considerations of the cloud storage scheme.

There are some technologies for performing data deduplication by using cloud storage or local storage in the market, but most of these technologies depend on the local offline calculation result completely, i.e. a hash value of each file is calculated, then the hash values are compared, and the associated information of the duplicate files is recorded locally, and when the files are requested to be accessed, access skip is performed by using the locally calculated associated information. For a distributed system with high requirements, the method means that a database needs to be deployed at multiple points, and the synchronization of associated data is kept, so that the implementation difficulty and the implementation cost are high.

Therefore, a cloud storage file deduplication method with better performance is needed.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a cloud storage file deduplication method with more excellent performance.

In order to solve the technical problems, the invention provides a cloud storage file deduplication method based on a link technology, which is characterized in that link files of duplicate files are created and stored in a cloud storage, link information of each link file is stored by utilizing the link files, the link information of each link file sequentially records the position of the duplicate file in the cloud storage, the serial connection of a real file and each duplicate file is realized, and a user accesses any duplicate file to find the real file through traversal of the link file. Because the link file is small and the link information is stored in the cloud storage and does not depend on local storage, the characteristics of single-point calculation and multi-point independent access of file deduplication in a distributed environment are facilitated, the problems of overlarge occupation of cloud storage space and high bandwidth cost caused by excessive repeated files are effectively solved, and the method can be simply and conveniently applied to a distributed system, and is characterized by comprising the following steps:

step 1, configuring a local file information table, wherein the file information table is used for recording file information of an existing file in a cloud storage system, and the file information comprises file contents and storage paths;

step 2, acquiring file content of a file to be uploaded, analyzing the file content, and calculating a hash value of the file content, acquiring a local file information table, and searching whether file information consistent with the hash value of the file content exists in the file information table, if not, judging that the file to be uploaded is a real file, executing step 3, and if so, judging that the file to be uploaded is a repeated file, and executing step 4;

Step 3, calling a file uploading interface of the cloud storage system, uploading file information of the real file into the cloud storage system, and recording a hash value of file content and a file storage path in a file information table;

step 4, newly creating a repeated file to be uploaded at the time as a link file, storing file information of the link file into a cloud storage system, and recording a hash value of file content of the link file and a file storage path in a file information table, wherein the hash value of the file content of the link file is the hash value of the file content obtained by calculation in the step 2;

and 5, traversing file information of the existing files in the cloud storage system, and executing file duplicate removal access to acquire file information of the real files in the cloud storage system.

The algorithm for calculating the hash value in the step 2 is MD5, and the cloud storage system in the step 2 is Amazon S3.

The step 4 includes marking a newly built link file with a preset text key value pair, writing a magic value, a version number and a link position of the link file into file contents of the newly built link file, wherein the magic value and the version number of the link file are preset identifiers with fixed byte numbers, acquiring file information of an existing file with the same hash value of the file contents in the step 2 from a file information table, taking a storage path of the last uploaded existing file as the link position according to the uploading time sequence of the existing file, and storing the file information of the file to be linked into a cloud storage system in a file metadata mode.

In the step 4, the magic value, version number and link position of the link file are written into the file content of the newly-built link file, and the magic value, version number and link position of the link file are written sequentially from the head of the file content;

The step 5 includes extracting a storage path in a file information access request, acquiring file content and text key value pairs from a cloud storage system according to the storage path, comparing whether the text key value pairs are consistent with the text key value pairs of the linked files in the step 4, judging that the currently acquired file is a real file if the text key value pairs are inconsistent with the text key value pairs of the linked files in the step 4, returning file information of the real file to a browser end, judging that the currently acquired file is the linked file if the file is consistent with the text key value pairs, extracting a link position in the file content, continuing to acquire the file content and the text key value pairs from the cloud storage system according to the storage path pointed by the link position, and comparing the text key value pairs until the file is judged to be the real file according to a comparison result.

In the step 4, the key of the text key value pair is Content-Type and the value is self-defined MIME Type, the text key value pair of the link file is indicated as < Content-Type and self-defined MIME Type >, the text key value pair is stored through the file metadata function of the Amazon S3, and the text key value pair in the file metadata is obtained when the file information is obtained from the Amazon S3.

In the step 5, the comparing text key values specifically includes:

Step 501, comparing whether the text key value pair is a key value pair of Content-Type, and whether the value of the key value pair of Content-Type is consistent with the self-defined MIME Type in step 4, if both the values are satisfied, the file is judged to be a link file, if the values are not satisfied, the file is judged to be a real file;

step 502, if the file is determined to be a link file, calculating the magic value of the link file and the fixed byte number occupied by the version number of the link file from the head of the content of the link file, taking the position of the fixed byte number plus 1 as the initial position, and extracting all the content of the file from the initial position as the link position.

In the steps 1-5, if the cloud storage system does not support the storage of the MIME type, determining whether the file is a link file according to the magic value of the link file, specifically, extracting the content with the same byte number from the header of the file content according to the byte number occupied by the magic value of the preset link file, if the extracted content is consistent with the magic value of the link file preset in the step 4, determining that the file is a link file, and if the extracted content is not the same, determining that the file is a real file.

The beneficial effects achieved by the invention are as follows:

According to the method, the hash value of the cloud storage file is calculated offline, the link file with extremely small content is created for the repeated file, the cloud storage capacity is reduced, and the bandwidth occupation during file uploading is reduced. Compared with the prior art, the method reduces the link relation storage required by the distributed system during access and reduces the cost of local storage. In addition, although multiple accesses appear to reduce the performance of file access, since the linked files are very small, the access speed is very fast and the performance loss is almost negligible.

Drawings

FIG. 1 is a schematic flow chart of a file deduplication method based on a link technology in an embodiment of the invention

FIG. 2 is a schematic diagram of a structure of linking file contents according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a file deduplication access process in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a system structure for removing duplicate files in cloud storage based on a link technology in an embodiment of the present invention;

FIG. 5 is a system deployment diagram of cloud storage file deduplication based on a link technology in an embodiment of the present invention;

Detailed Description

The invention is further described below with reference to the drawings and exemplary embodiments.

In the embodiment of the invention, amazon S3 is used as a cloud storage system, wherein Amazon S3 is totally called Amazon simple storage service (Amazon Simple Storage Service), which is a service provided by Amazon network service (Amazon Web Services, AWS for short) and provides object storage service through various communication modes. Amazon S3 can be used to store any type of object that can be used for storage like internet applications, backup and restore, disaster recovery, data archiving, data lakes for analysis, and hybrid cloud storage. For example, a user uploads a picture to an amazon S3 cloud storage system, and the amazon S3 cloud storage system synchronizes the picture uploaded to the cloud storage system to each edge node of the amazon S3 to accelerate local client access in order to help the user to quickly display the picture to a target client in each global place.

In the embodiment of the invention shown in fig. 1, a method for implementing file deduplication based on link technology specifically includes:

Step 101, configuring a local file information table by a storage path, wherein the file information table is used for recording file information of an existing file in an Amazon S3 cloud storage system, the file information comprises file contents and a storage path, and a configuration item of the file information table comprises hash values of the file contents and the storage path of the file;

Step 102, calculating a hash value, judging whether a file to be uploaded is a repeated file, specifically, acquiring file information to be uploaded into an Amazon S3 cloud storage system, analyzing file content, calculating the hash value for the file content by utilizing an MD5 algorithm, searching whether file information consistent with the hash value exists in a file information table, if so, judging that the file to be uploaded is the repeated file, and executing step 103;

Step 103, newly creating a repeated file to be uploaded at the time as a link file, storing file information of the link file into a cloud storage system, and recording a hash value of file content of the link file and a file storage path in a file information table, wherein the hash value of the file content of the link file is a hash value of the file content obtained by calculation in the step 102;

The method comprises the steps of firstly marking a newly built link file by a preset text key value pair, secondly writing a magic value, a version number and a link position of the link file into file contents of the newly built link file, wherein the magic value and the version number of the link file are preset identifiers with fixed byte numbers, acquiring file information of an existing file which is the same as a hash value of file contents in step 102 from a file information table, taking a storage path of the last uploaded existing file as the link position according to the uploading time sequence of the existing file, and finally storing the file information of the file to be linked into a cloud storage system and storing the text key value pair in a file metadata mode;

The specific flow of creating the link file comprises the following steps:

Step 103-1, writing the content of a link file, wherein the writing comprises writing a file type magic value, a link file version number and a link position in sequence;

Writing a preset link file magic value into a file header, such as writing a link;

The writing of the version number of the link file is that after the magic value of the link file, the version number of the preset link file is written into the file content, such as writing in 'V1.0' after 'link';

The method comprises the steps of writing a link position, namely acquiring file information records with the same hash value as the repeated file from a file information table, sequencing from front to back according to the uploading time of the file, taking a cloud storage position in the last file information record as a value of the link position, and writing the cloud storage position after a version number of the link file;

the writing code is configured in a file uploading program, and specifically comprises the following steps:

private byte[]createLinkFile(String fileLocation,byte version){

numerical definition of/(magic)

final String MAGIC_NUMBER="link";

Sequentially splicing three groups of bytes, namely, magic number bytes, version bytes and file position

return combineBytes(MAGIC_NUMBER.getBytes(StandardCharsets.UTF_8),new byte[]{version},fileLocation.getBytes(StandardCharsets.UTF_8));

The structure of the content of the LINK file in the exemplary embodiment of the present invention shown in fig. 2 is schematically shown, where the content of the LINK file starts with the magic value "LINK" of the LINK file, and immediately follows the version number "V0.1", where all bytes are the POSITION of the previous LINK file of the duplicate file in the cloud storage, as "POSITION" in the figure. Since the magic value of the linked file and the version number of the linked file are preset values and have fixed byte numbers, all bytes after the fixed byte numbers are the linked positions;

Step 103-2, marking a newly-built link file by a preset text key value pair, namely adding a preset text key value pair for the file by a file uploading program in the process of uploading the file to an Amazon S3 cloud storage system, wherein the key is a Content-Type in the text key value pair, and the value is a MIME Type, wherein the MIME Type comprises a MIME Type conventional value and a custom MIME Type value;

For non-linked files, the MIME type marks the public MIME type conventional values, such as image/jpg, text/html, etc., according to the file type. For a link file, the MIME type is a custom MIME type, such as "application/vnd. Mic-link";

the code for marking the link file is configured in the file uploading program as follows:

In the code, the text key value pair is written into a file uploading program, and a key value pair is added for a file when uploading the file, and the < Content-Type, application/vnd. Mic-link > is written in the case of linking the file, wherein the key value pair is used for judging whether the file is a linked file or not when acquiring the file;

MIME types (Multipurpose INTERNET MAIL Extensions, a public internet file transfer type definition) are composed of type/subtype, and are composed of two character strings of type and subtype separated by a '/' therebetween. Space is not allowed to exist. type represents an independent category that can be divided into a plurality of sub-categories, and subtype represents each type after subdivision. All people can customize the MIME type, private or company needs to prefix vnd on subtype and assist entity identification when making protocol in order to avoid duplication and misunderstanding, and then file type. For example, application/vnd.ms-excel, where vnd is the actual format that identifies this MIME type as custom, ms is Microsoft, and excel is the file.

And 104, traversing file information of the existing files in the cloud storage system, and executing file revisitation.

The flow diagram of file deduplication access in the exemplary embodiment of the present invention as shown in fig. 3, the specific flow includes:

104-1, a user requests to access file information of an existing file in the Amazon S3 cloud storage system through a browser, a server of the cloud storage system responds to the request, a storage and storage path in the file information request is obtained, and the file information and file metadata are extracted from the Amazon S3 cloud storage system according to the storage path;

104-2, extracting text key value pairs with keys of Content-Type from file metadata, and comparing whether the values in the text key value pairs are MIME Type conventional values or custom MIME Type values, wherein the custom MIME Type values are like application/vnd.mic-link of the step 103;

If the value in the text key value pair is the MIME type conventional value, judging that the currently acquired file is a real file;

if the value in the text key value pair is application/vnd. Mic-link, judging that the acquired file is a link file, extracting file information of the link file, analyzing and acquiring a link position in file content to acquire a storage path of the file information of the last repeated file in an Amazon S3 cloud storage system;

104-3, extracting file information from the Amazon S3 cloud storage system according to the storage path of the step 104-2, judging the file according to the step 104-2 until the file is judged to be a real file, and returning the file content to the browser end;

In the embodiment of the invention, since the preset magic value of the LINK file and the version number of the LINK file occupy a fixed byte number, for example, the agreed magic value of the LINK file is 'MIC-LINK', 8 bytes are occupied, then the version number of the LINK file is 1.0, one byte is occupied, and 9 bytes are occupied, and then the LINK positions are all from the 10 th byte to the next.

In the embodiment of the invention, when the access workstations deployed in all places access the Amazon S3 cloud storage system site, each workstation judges the MIME type of the extracted file so as to confirm whether the file is a link file. Since the Amazon S3 service itself can mark the file type for it, a custom MIME type, application/vnd. Xx-link, is set for the link file alone. Thus, when the picture is accessed, the type of the picture can be quickly known. For a normal picture, we use its original MIME format, e.g. image/png, image/jpg, etc.

The system structure schematic diagram of file deduplication in cloud storage based on a link technology in the exemplary embodiment of the invention shown in fig. 4 specifically comprises a local file server, an application server, a cloud storage system and a file information data table which are configured in a local machine room;

The local file server is used for initiating an uploading request of file information;

The application server is internally provided with a repeated calculation module, a file duplication removal module, a file uploading module and a file access module;

the repeated calculation module is used for performing hash value calculation on file contents of the uploaded files and calling a file information data table to perform hash value comparison so as to judge whether the files to be uploaded are repeated with files in the cloud storage;

the file deduplication is to send the file uploading module to the file uploading module according to whether the file repeatedly executes different operations or not;

the file uploading module is used for calling a file uploading interface of the cloud storage system, executing uploading of real files and link files into the cloud storage system;

The file access module is used for acquiring files in the cloud storage system according to the file storage position;

The cloud storage system is an Amazon S3 cloud storage system and is used for storing link files corresponding to real files and repeated files;

disposing at least 1 uploading server in the local machine room, allowing multiple uploading servers to execute file uploading concurrently;

The storage position of the link file of the last repeated file in the cloud storage system is stored in the link file content, and the link information of the storage positions of the real file and the repeated files is stored in the cloud server, so that file access deduplication is not dependent on a local file information data table any more, the associated data synchronization step in a distributed environment is eliminated, and access deduplication in the cloud storage system is realized in a simpler and low-cost mode.

The system deployment diagram for cloud storage file deduplication based on the link technology in the exemplary embodiment of the invention shown in fig. 5 comprises a file uploading server, a file record table, a cloud storage system and an access agent, and is arranged on a system deployment level. Multiple upload servers may concurrently perform the task of uploading the picture file. They rely on a local file record table together. By virtue of this local file record table, the upload server can compare the MD5 information to determine whether a particular picture has been uploaded. The access agent, the cloud-stored access service, is not dependent on the local file record table, as the link information has been uploaded to the cloud in the form of a link file. Multiple access agents may operate concurrently.

The invention provides a cloud storage file deduplication method based on a link technology, which has the beneficial effects that:

And by offline calculation of the hash value of the cloud storage file, a link file with extremely small content is created for the repeated file, so that the cloud storage capacity is reduced, and the bandwidth occupation during file uploading is reduced. Compared with the prior art, the method reduces the link relation storage required by the distributed system during access and reduces the cost of local storage. In addition, although multiple accesses appear to reduce the performance of file access, since the linked files are very small, the access speed is very fast and the performance loss is almost negligible.

The above embodiments are not intended to limit the present invention in any way, and all other modifications and applications of the above embodiments which are equivalent to the above embodiments fall within the scope of the present invention.

Claims (6)

1. A cloud storage file deduplication method based on a link technology is characterized by comprising the following steps of 1, configuring a local file information table, wherein the file information table is used for recording file information of existing files in a cloud storage system, and the file information comprises file contents and storage paths;

Step 2, acquiring file content of a file to be uploaded, analyzing the file content, and calculating a hash value of the file content, acquiring a local file information table, and searching whether file information consistent with the hash value of the file content exists in the file information table, if not, judging that the file to be uploaded is a real file, executing step 3, and if so, judging that the file to be uploaded is a repeated file, and executing step 4;

Step 3, calling a file uploading interface of the cloud storage system, uploading file information of the real file into the cloud storage system, and recording a hash value of file content and a file storage path in a file information table;

step 4, newly creating a repeated file to be uploaded at the time as a link file, storing file information of the link file into a cloud storage system, and recording a hash value of file content of the link file and a file storage path in a file information table, wherein the hash value of the file content of the link file is the hash value of the file content obtained by calculation in the step 2;

The step 4 comprises marking a newly built link file with a preset text key value pair, writing a magic value, a version number and a link position of the link file into the file content of the newly built link file, wherein the magic value and the version number of the link file are preset identifiers with fixed byte numbers, acquiring file information of an existing file with the same hash value of the file content in the step 2 from a file information table, and taking a storage path of the last uploaded existing file as the link position according to the uploading time sequence of the existing file;

Step 5, traversing file information of the existing files in the cloud storage system, executing file duplicate removal access, and obtaining file information of real files in the cloud storage system;

The step 5 includes extracting a storage path in a file information access request, acquiring file content and text key value pairs from a cloud storage system according to the storage path, comparing whether the text key value pairs are consistent with the text key value pairs of the linked files in the step 4, judging that the currently acquired file is a real file if the text key value pairs are inconsistent with the text key value pairs of the linked files in the step 4, returning file information of the real file to a browser end, judging that the currently acquired file is the linked file if the file is consistent with the text key value pairs, extracting a link position in the file content, continuing to acquire the file content and the text key value pairs from the cloud storage system according to the storage path pointed by the link position, and comparing the text key value pairs until the file is judged to be the real file according to a comparison result.

2. The method for deduplicating cloud storage files based on the link technology as claimed in claim 1, wherein the algorithm for calculating the hash value in the step 2 is MD5, and the cloud storage system in the step 2 is Amazon S3.

3. The method for deduplication of cloud storage files based on the link technology as claimed in claim 1, wherein in the step 4, the magic value, version number and link position of the link file are written into the file content of the newly created link file in sequence from the head of the file content.

4. The method for deduplicating cloud storage files based on the link technology as claimed in claim 1, wherein in the step 4, the key of the text key value pair is Content-Type and the value is self-defined MIME Type, the text key value pair of the link file is expressed as < Content-Type and self-defined MIME Type >, the storage of the text key value pair is realized through a file metadata function of the Amazon S3, and the text key value pair in the file metadata is acquired while the file information is acquired from the Amazon S3.

5. The method for deduplication of cloud storage files based on the link technology as claimed in claim 1, wherein in the step 5, the comparison text key value specifically comprises:

Step 501, comparing whether the text key value pair is a key value pair of Content-Type, and whether the value of the key value pair of Content-Type is consistent with the self-defined MIME Type in step 4, if both the values are satisfied, the file is judged to be a link file, if the values are not satisfied, the file is judged to be a real file;

Step 502, if the file is determined to be a link file, calculating the magic value of the link file and the fixed byte number occupied by the version number of the link file from the head of the content of the link file, taking the position of the fixed byte number plus 1 as the initial position, and extracting all the content of the file from the initial position as the link position.

6. The method for removing duplicate files in cloud storage based on link technology as claimed in claim 1, wherein in the steps 1 to 5, if the cloud storage system does not support MIME type storage, it is determined whether the file is a link file according to the magic value of the link file, specifically, according to the number of bytes occupied by the magic value of the preset link file, the content with the same number of bytes is extracted from the header of the file content, if the extracted content is consistent with the magic value of the link file preset in the step 4, it is determined as a link file, and if not, it is determined as a real file.

CN202111323766.XA 2021-11-10 2021-11-10 A cloud storage file deduplication method based on link technology Active CN114048180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111323766.XA CN114048180B (en) 2021-11-10 2021-11-10 A cloud storage file deduplication method based on link technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111323766.XA CN114048180B (en) 2021-11-10 2021-11-10 A cloud storage file deduplication method based on link technology

Publications (2)

Publication Number Publication Date
CN114048180A CN114048180A (en) 2022-02-15
CN114048180B true CN114048180B (en) 2025-01-21

Family

ID=80207867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111323766.XA Active CN114048180B (en) 2021-11-10 2021-11-10 A cloud storage file deduplication method based on link technology

Country Status (1)

Country Link
CN (1) CN114048180B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136243A (en) * 2011-11-29 2013-06-05 中国电信股份有限公司 File system duplicate removal method and device based on cloud storage
CN103714123A (en) * 2013-12-06 2014-04-09 西安工程大学 Methods for deleting duplicated data and controlling reassembly versions of cloud storage segmented objects of enterprise

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9633022B2 (en) * 2012-12-28 2017-04-25 Commvault Systems, Inc. Backup and restoration for a deduplicated file system
US10691643B2 (en) * 2017-11-20 2020-06-23 International Business Machines Corporation Deduplication for files in cloud computing storage and communication tools
CN108400970B (en) * 2018-01-20 2020-10-02 西安电子科技大学 Similar data message locking, encrypting and de-duplicating method in cloud environment and cloud storage system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136243A (en) * 2011-11-29 2013-06-05 中国电信股份有限公司 File system duplicate removal method and device based on cloud storage
CN103714123A (en) * 2013-12-06 2014-04-09 西安工程大学 Methods for deleting duplicated data and controlling reassembly versions of cloud storage segmented objects of enterprise

Also Published As

Publication number Publication date
CN114048180A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
CN106250270B (en) 2019-05-21 A kind of data back up method under cloud computing platform
US8843454B2 (en) 2014-09-23 Elimination of duplicate objects in storage clusters
US8380673B2 (en) 2013-02-19 Storage system
US10430443B2 (en) 2019-10-01 Method for data maintenance
US10210191B2 (en) 2019-02-19 Accelerated access to objects in an object store implemented utilizing a file storage system
CN108255647B (en) 2021-03-23 High-speed data backup method under samba server cluster
CN106294585A (en) 2017-01-04 A kind of storage method under cloud computing platform
CN109522283B (en) 2021-09-21 Method and system for deleting repeated data
US9367569B1 (en) 2016-06-14 Recovery of directory information
US9547706B2 (en) 2017-01-17 Using colocation hints to facilitate accessing a distributed data storage system
US8095678B2 (en) 2012-01-10 Data processing
US11221921B2 (en) 2022-01-11 Method, electronic device and computer readable storage medium for data backup and recovery
CN104951474A (en) 2015-09-30 Method and device for acquiring MySQL binlog incremental logs
CN102495772B (en) 2013-10-30 Characteristic-based terminal program cloud backup and recovery methods
CN107590019B (en) 2021-03-16 A method and device for data storage
CN114629921B (en) 2023-11-17 Cloud platform and bucket management method for object storage service provided by cloud platform
CN102708165A (en) 2012-10-03 Method and device for processing files in distributed file system
US10684920B2 (en) 2020-06-16 Optimized and consistent replication of file overwrites
CA2710754C (en) 2016-10-04 Systems and methods for platform-independent data file transfers
CN114048180B (en) 2025-01-21 A cloud storage file deduplication method based on link technology
CN111414239B (en) 2023-01-31 Virtual machine mirror image management method, system and medium based on kylin cloud computing platform
JP2005063374A (en) 2005-03-10 Data management method, data management device, program for the same, and recording medium
CN103902577A (en) 2014-07-02 Method and system for searching and locating resources
CN103714089A (en) 2014-04-09 Method and system of rolling back cloud database
CN111770158B (en) 2023-09-19 Cloud platform recovery method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
2022-02-15 PB01 Publication
2022-02-15 PB01 Publication
2022-03-04 SE01 Entry into force of request for substantive examination
2022-03-04 SE01 Entry into force of request for substantive examination
2025-01-21 GR01 Patent grant
2025-01-21 GR01 Patent grant