Share, , Google Plus, Pinterest,


Posted in:

Data Is The Secret Sauce To Effective Machine Learning Recipe

Importance of data in Machine Learning for effective cyber-defense

In the modern world of innovations around automation, the replacement of humans with machines is talked about on a daily basis in the media. Some of the most talked about use cases include driver-less cars and automated checkout at retail outlets. Additionally, there are some not so obvious use cases such as Google, Facebook and Amazon recommending products that you are most likely be interested in purchasing based on data they collect while you shop or browse online; or Netflix putting the show or movies that you are most likely to watch at the top of your list on their app. The primary element used by all these companies to accomplish this feat is data.

The data, when aligned to real business needs, can provide great insight that is not otherwise possible. Let’s discuss why and how data plays a very important role in facilitating an automated solution suited to customer needs, specifically in the world of cybersecurity for effective defense techniques.

A Thief in the Night

Imagine a scenario where a thief is breaking into your house; similar to an attacker entering your network. The thief’s constraints in attempting to steal anything include the limited time on hand, the need to find locations where the valuables are kept, and the weight and the size of the valuables he or she plans to take, among many other constraints. The thief is likely to use abnormal means to enter the house, such as breaking the door or entering through an open window, and will be searching for valuables at random once he or she gets in. Additionally, these activities have to happen rapidly before the thief exits.

The behavior of the thief is quite different from the normal resident of the house. The normal inhabitant of the house is likely to repeat most of his or her activities to a certain degree. These regular activities can be learned and mathematically defined as a set of expected behaviors of the normal occupant of the premises. Similarly, some of the characteristics of a thief can be learned from experience and modeled for a mitigation strategy to be put in place. One real-life mitigation example we are all familiar with is the alarm system in the house –it provides the response to an unauthorized entry in some fixed amount of time. However, more sophisticated thieves diligently account for this mitigation and evolve their methods over time to bypass the alarm, allowing them to commit crimes more freely.

Like sophisticated thieves in the example above, cyber-attacks are growing in sophistication and in numbers. The damage resulting from this increase is felt in multiple circles – the political arena, financial institutions and certainly by individual citizens. The effect of some of these damages can last a lifetime and can bring the victims to the brink of disaster. Emerging modern data science techniques, when combined with myriad types of available data, can help mitigate this to a good extent.

The Importance of data

How can data help? We now have access to many multiple forms of data that can be gathered and compared to identify the attack. Consider the data we know about the thief—the method of entry to the house, the searching for an item in house, the time of entry into the house, the number of items leaving the house and the speed of collecting different items in the house. Recognize that these are only a small list of available data points.  When such characteristics are learned and compared, an attack can be determined with a very high degree of accuracy. Yes, the data is quite complex, but is also key to accurate attack detection.

Enterprise Network Complexity

When a simple situation of a robbing a house is extrapolated to an attack on a complex enterprise network environment, the defense also becomes quite complex. In the not so distance past, the cyber-attacker had to perform extensive searches in a network to identify critical and valuable information such as credit card numbers, customer identifications and private health data for a successful breach. In current times of ransomware-style attacks and the availability of cryptocurrency, the attacker does not have to search and take away anything. Instead, the attacker becomes a blackmailer and makes everything unusable unless the ransom is paid. The cryptocurrency makes the ransom transaction harder to trace and helps keep the blackmailer hidden. Clearly, such blackmailing is not very realistic in a typical house robbing scenario; however, it is imperative that the data gathering in the enterprise network environment has to be much smarter to counter such complex attacks.

Traditional tools such as perimeter devices leverage known historical information to thwart attacks. Devices like firewalls will periodically receive threat intelligence about known blacklisted attackers such as IP addresses, URLs and domains. The attacks were mitigated by actions such as blocking or rejecting such connections. These devices also received the signature of the attack artifacts such as files and used these to identify and defend against attacks. IP Addresses, URLs, domains and signatures thus make some of the dimensions of the data for security defense.

In the world of BYOD (Bring your own devices) and cloud based infrastructures, the perimeter distinction has become blurred and hence the required number of data dimensions have enlarged significantly. These data dimensions typically are proportional to the attack vectors as these dimensions represent the method used to attack. Thus, an increase in data dimensions has increased the attack surface and therefore, the complexity of the cyber defense. This increase in attack vectors must be countered by more sophisticated defense strategies.

Feature Definition Process

To structure enterprise-level cyberattack defenses, the attack surface and the corresponding attack vectors should be identified. Today’s Advanced Persistent Threats (APTs) progress over time, making it difficult to detect them early in their track. The attack vectors have to account for slow moving threats that use sophisticated methods to hide or stay under the radar. To account for such threats, data has to be collected over time and then translated into attack vectors. This translation of collected data over time into attack vectors is referred to as feature definition or feature engineering by some groups. The attack vectors and the corresponding features for such APTs usually include mathematical time series functions. These functions play key role in effective detection and containment of these APTs as early as possible to minimize or eliminate the damage.

One method of countering complex APTs is to define the features that will progressively evolve at a faster pace than the progression of the attack surface. It can however, start out with minimal information to track the progression and surface the threats early in its inception. Modern feature definitions are a lot more complicated and involved, making it difficult to facilitate an effective cyber defense approach to today’s attacks. In practice, features identification and tuning are the most important aspects of an effective cyber defense strategy, and correspondingly, takes up the largest portion of the machine learning team’s effort. Many of them incorporate AI to make their machine learning based cyber defense techniques more effective. Even though sophisticated algorithms are required for good cyber defense, they will fall far short of their capabilities in the absence of good feature engineering practices. One is likely to see that well-conceived and properly executed feature engineering can make even the average algorithms work quite well.

One may have heard the streetlight effect, a term sometimes used to describe a form of observational bias.  It describes a scenario where a drunkard is asked, “Did you lose the keys here?” and the answer is, “No, but the light is much better here.” In similar fashion, the data fueling the feature definition process is one of the most important aspects of an effective cyber defense strategy. Use of an incoherent approach similar to “looking for keys under a light” for addressing weakness in the attack surface is likely to put the cyber defense team off track and result in dangerous exposure, putting a company at considerable business risk for a breach. It is important to know where you are likely to be attacked from and, the corresponding attack vectors and attack surface. Blind to this information, and you are more likely to be a victim of cyberattack. The takeaway? Know that data is the secret sauce to determining effective detection and containment of threats.