Yachay is an interconnected network of microservices and databases used to process natural language data. Initially just a collection of scripts used for tokenization, categorization, and extracting sentiment, it slowly grew into a full-fledged natural language processing system that sifts through gigabytes of real-time data every day. Those data feeds include traditional media (i.e. New York Times articles), social media (i,e. Twitter & Reddit), messenger channels, tech blogs, GitHub profiles and issues, the dark web, and legal proceedings, as well as the decisions and publications of government regulators and legislators around the world.
Yachay has collected decades worth of useful natural language data from those sources. While we have manually subscribed to a number of data feeds, such as official press release channels, most of our information comes from relevant content discovered through analysis of social media, chat channels, news aggregators, etc. This means that even if we missed an original post, the post will get into our system by being referenced by someone in the communities we monitor. Yachay components will not only grab that message, but will also follow any links to the original content to fully process and integrate it.
Yachay is already used by our clients in fintech, insurance, blockchain, regtech, and national security, and is rapidly expanding to other industries. Here are just some of the features used by Inca clients today:
Extracting meaningful content from media-rich resources is not a trivial task. The web pages we parse are often dynamic and contain information unrelated to main article elements, such as ads, widgets, legal announcements, etc. In these cases, we employ a combination of machine learning and rule-based techniques that power “reader view” modes in modern browsers. They allow us to clean an article’s content and remove any unwanted elements to focus on what really matters.
Applying Optical Character Recognition allows Yachay to convert images into text before handing it over to other modules for processing. OCR is extremely useful when we come across non-text based data sources, such as scanned PDF documents. In other situations, social media users might post screenshots or images. Before processing, we run extracted images through our media-to-text components and external services. Our media-to-text tools work not only with text-based images, but also with photos, drawings, video, and audio files.
In addition to transcribing speech, we leverage neural network-based machine vision technologies to generate descriptions for media files to see if they contain any keywords our clients may be interested in and to recover the associated metadata. This approach is also effective in detecting fishing and defamation campaigns, which often use images to get around text-based spam filters.
Language-Agnostic Sentiment Analysis
Yachay provides sentiment scores for each triggered keyword by leveraging neural network services from Amazon, Google, and IBM. We focus on specific individual mentions of keywords and data structures (named entities, bitcoin addresses, etc.) that trigger an event. Our sentiment analysis runs on all relevant parts of the context that contains the keyword, including phrase, context, and the whole document. The context is 200 characters on the left and right of any keyword. If it is found within the article, then you’ll get the article contents. If it is found in an external resource, such as a link or a PDF document, you’ll get the contents of that external resource. We also produce term-based sentiment scores using a more advanced NLP service from IBM Watson. This is where you get the sentiment related to the specific keyword or data structure that we found.
Compared to machine translations, that are often not accurate and require human analysts to extract intelligence, sentiment models have a much easier job. Just like humans who might not fully understand a foreign language, but easily pick up the general mood of a phrase, neural networks are capable of accurately quantifying sentiment. This sentiment score becomes an immediate actionable intelligence signal that can produce alerts, be used for deeper analysis, or be fed into other systems, such as smart order routers or risk management tools. For example, traders need to see real time decisions made by the Securities and Exchange Commission, Commodity Futures Trading Commission, or the People’s Bank of China the second they get published. While it may take days to translate a 200-page document in Mandarin and make its way from social media to the front page of the New York Times, Yachay users can get alerted about relevant term sentiment changes in the matter of seconds.
Finding relevant content requires us to sift through gigabytes of natural language data every day. Topic modeling assists us in this task and allows us to categorize relevant content before passing it to other NLP components. Some topics are easy to detect because Google and other data companies already trained corresponding neural network models. Other topics, such as those relevant to digital assets, require us to train new models from scratch by feeding hundreds of thousands of relevant articles into the system and continuously improving them with ever-changing jargon, technical terms, and named entities.
We also work with use cases that include content classification far beyond general topic modeling. Not being able to rely on dictionaries with keywords, we often build custom categorization models that tell us whether a specific event is being described. Among many others, we’ve built models for security events, such as money custodian attacks, internal fraud, customer support problems, and critical infrastructure uptime. Our customers often quantify extracted intelligence by looking at the rate of changes in the categorized message number over time. This allows them to discover events happening in real time, rather than requiring them to read conclusions post-factum. Our historical datasets also provide additional context for investigations, showing how the information spread through communities, and where it originated.
Content categorization can often include additional logic that is important for extracting intelligence. For example, Yachay can classify regulator decisions not only as positive and negative, but also as a decision date extension, which in itself might have a serious impact on businesses involved in the ruling.
Named Entity Extraction
Yachay modules enrich processed natural language events with named entities they contain, including person names, companies, government actors, locations, etc. They attach named entity labels to corresponding events, allowing for easy search and analysis even in situations where the name is written differently or is not a part of specific keyword occurrences. For example, regulator litigation releases and enforcement action events will include normalized names of companies and persons associated with each document. This allows event correlation across different datasets and improves our aggregated agent database, which contains entity names, titles, affiliations, user names across social media, etc.
Our aggregated agent database is also what fuels our content discovery module and is the main reason we do not need to hardcode a list of resources to monitor. Just by taking a very moderate number of manually compiled lists of agents, Yachay is capable of discovering a network of interconnected influencers, including software developers, security experts, finance professionals, etc. These are not simply accounts with the biggest number of followers. We focus on those who produce original content before their message gets amplified by others. The profile information of identified influencers helps us discover their user accounts on other platforms. While someone’s Twitter account might be useful to establish the initial importance of the agent, it is often their GitHub account that helps us discover the original content and hidden affiliations.
Coupled with content categorization, our named entity processing modules can identify natural language data, generated by a particular type of agent. For example, Twitter often shares datasets containing messages used in state-sponsored attacks. This has made it significantly easier for our prediction models to identify such content in the future, as well as appropriately tag all our historical datasets. We also had success with building models capable of identifying phishing attempts, unregistered securities scams, illegal trading, as well as market manipulation attempts, to name a few. Real-time mapping of agent relationships allows us to follow the pools of agents where relevant information originates and filter out bot-generated content at the data intake stage.
Yachay has grown from being a stand-alone tool to enabling comprehensive real-time integration options. It can now be used to consume real-time data from any web page, RSS feed, most social networks and chat applications, dark web, REST APIs, SFTP servers, S3 buckets, etc. We partnered with Splunk to provide our customers with a comprehensive big data analysis suite. We leverage interconnectivity with other information processors, such as Google, Amazon, and IBM speech APIs and our own AI module, which finds patterns, simulates agent behavior, and predicts future outcomes. The resulting data can be consumed via a dedicated web platform, email alerts with JSON/CSV/PDF attachments, periodical data dumps, or a well documented and flexible REST API.