Starfish helps tame large amounts of unstructured data

"What information do you have?" Can I access it?" These questions may seem simple to any data-driven business. However, when there are billions of files spread across the storage space on a parallel file system, these questions actually become very difficult to answer. This is also where Starfish storage shines, as its unique data discovery tools are already used by top HPC sites in many countries and a growing number of GenAI stores.

There are some contradictions in the world of high-end unstructured data management: the larger the file system, the less you know about it. The more bytes you have, the more useless those bytes become. The closer we get to using unstructured data to achieve brilliant, amazing things, the greater the challenge of file access.

This is what Starfish Storage founder Jacob Farmer has encountered time and time again since he started the company 10 years ago.

"Everyone wants to dig up their own files, but they will face the brutal truth: they don't know what they have, most of what they have is garbage, they can't even access those files, they can't do anything. He said.

Many big data challenges have been solved over the years. The physical limitations of data storage have been largely eliminated, enabling organizations to store petabytes and even exabytes of data across distributed file systems and object storage. A lot of processing power and network bandwidth is available. Advancements in machine learning and artificial intelligence have lowered the barrier to entry for HPC (high-performance computing) workloads. The generative artificial intelligence (GenAI) revolution is in full swing, and well-respected AI researchers are talking about creating artificial general intelligence (AGI) within a decade.

We've benefited from all these advancements, but we still don't know what's in the data, who can access it?

"The hardest part for me was explaining that these issues hadn't been resolved. Farmer continued, "People think it's a fact of life, so they don't even try to do anything. They won't get into your unstructured data because it's widely considered uncharted territory. ”

Farmer elaborated on the nature of the unstructured data problem and Starfish's solution.

"The question we're trying to solve is, 'What the hell are these documents?'" He said. "When it comes to file management, you can't handle billions of files unless you have powerful tools. You can't do anything. ”

Run a search on your desktop file system and it will take you a few minutes to find a particular file. Try to do this on a parallel file system made up of billions of individual files that take up petabytes of storage space and may have to wait quite a long time.

Most Starfish customers are actively using large amounts of data stored in parallel file systems, such as Lustre, GPFS/Spectrum Scale, HDFS, XFS, and ZFS, as well as file systems used by storage vendors such as VAST data, Weka, Hammerspace, etc.

Many of Starfish's customers are working on high-performance computing or artificial intelligence, including customers at U.S. national laboratories such as Lawrence Livermore and Sandia, research universities such as Harvard, Yale, and Brown, government organizations such as the Centers for Disease Control and Prevention (CDC) and the National Institutes of Health (NIH), research hospitals such as Cedar Sinai Children's Hospital and Duke Health, animation companies such as Disney and DreamWorks, and most of the top pharmaceutical research companies. For a decade, Starfish's customers managed more than 1 exabyte of data.

These agencies need access to data for HPC and AI workloads, but in many cases, this data is spread across billions of individual files. The file system itself often doesn't provide tools to tell you what's in a file, when it was created, and who controls access to it. Files may have timestamps, but they can be easily changed.

The problem is that this metadata is critical to deciding whether the file should be retained, moved to an archive running on low-cost storage, or deleted entirely.

Starfish Method

Starfish takes a metadata-driven approach to tracking the origin date of each file, the type of data contained in the file, and who the owner is. The product uses the Postgres database to maintain an index of all files in the file system and how they have changed over time. When an action needs to be taken on a set of files (e.g., deleting all files older than a year), Starfish's tagging system makes it easy for administrators with the appropriate permissions to do so.

There is another paradox that arises when it comes to tracking unstructured data. "You have to know what a file is to know what a file is," Farmer said. "Usually you have to open the file and look at it, or you need user input, or you need some other API to tell you what the file is. As a result, our entire metadata system allows us to understand what is at a deeper level. ”

Starfish isn't the only player in the field. There are competing unstructured data management companies, as well as data catalog vendors that focus primarily on structured data. However, the biggest competitors are HPC sites that think they can build file directories based on scripts. Some of these script-based methods work for a while, but when they hit the upper echelons of file management, they become confusing.

"Customers with 20 ZFS servers may have their own way of doing what we do. No single file system is that big, and they probably know where to look, so they might be able to do it with traditional tools. He said. "But when the file system gets big enough, the environment becomes diverse enough, or when people start spreading files over a wide enough area, we become a 'location map' of where the files are, and tools to do whatever you need to do. ”

There are also a lot of edge cases that can be more difficult. For example, data can be moved by researchers, and directories can be renamed, leaving broken links behind. Some applications may generate 10,000 empty directories or create more directories than actual files.

"If you hit the market with a traditional product that's built for the business, it's going to crash," Farmer said. ”

Unstructured file management projects

Farmer saw this challenge as an engineering problem, and he and his team devised a solution to it.

Postgre-based indexing allows Starfish to maintain a complete history of the file system, so customers can see exactly how the file system has changed. The only way, Farmer says, is to repeatedly scan the file system and compare the results to the previous state. At Lawrence Livermore National Laboratory, the Starfish directory is about 30 seconds behind the production file system. "So we're doing a very, very close sync. He said.

Some file systems are more difficult to handle than others. For example, Starfish leverages the internal policy engine exposed by IBM's GPFS/Spectrum Scale file system to gain insights and inform Starfish crawlers. However, it turned out to be very difficult to obtain this data from Lustre.

"Lustre doesn't give up its metadata easily. It is not a high metadata performance system. Lustre is the hardest to scrape of all file systems, and we got the best results on it because we were able to use some other Lustre mechanisms to make a super powerful crawler. ”

Some commercial products make it easy to track data. For example, Weka makes it easier to expose metadata, and VAST has its own data directory that somewhat replicates the work done by Starfish. In this case, Starfish is involved in the services provided by VAST to help customers get what they need. "We deal with everything, but in many cases, we do specific engineering to take advantage of the nuances of a particular file system," Farmer said. ”

Get the data

Accessing structured data is usually straightforward. Some people from the line of business often own the data on Snowflake or Teradata, and they grant or deny access to that data based on the company's policies.

In the world of unstructured data, that's not how it usually works. The file system is considered to be a part of the IT infrastructure, so the person who controls access to the file is the storage administrator or system administrator. This creates problems for researchers and data scientists who want to access this data, Farmer said.

"The only way to access all your files or help yourself analyze files that don't belong to you is to have root access to the file system, which is not possible in most organizations. "I have to sell the product to the people who run the infrastructure because they have root privileges, so they decide who has access to which files." ”

To some extent, Farmer says, it's confusing why organizations should rely on outdated, 50-year-old processes to access potentially the most important data in an organization, but it's just the way it is. "It's kind of funny that everybody is stuck in an outdated pattern," he said. "It's both their strengths and their weaknesses. ”

On the surface, Starfish is a data discovery and data directory for unstructured data, but it can also serve as an interface between data scientists who want to access the data and administrators with root access to whom the data can be provided. Without an intermediary like Starfish, accessing, moving, archiving, and deleting requests can be much less efficient.

The POSIX file system is a very limited tool. I'm in my 50s. "We've figured out ways to work within those constraints to make it easy for people to do things that would otherwise require a list, via email or phone call or whatever." We can seamlessly use filesystem-related metadata to drive processes. ”

We may be at the forefront of developing artificial intelligence with superhuman cognitive abilities that will allow IT to evolve faster than it is now, changing the fate of the world forever. Don't forget to be friendly when requesting access to data from the Storage Administrator.