option a >> the walled garden

>Keeping all your content behind a walled garden. Carefully curate and nurture the kind of content that exists on your platform. By avoiding the urge to making all the content public and inducing vulnerabilities of being included in the training set of an LLM, you can build **data moats**. Large Language Models are being commoditised, which means that data is the new moat and social networks that rank high on user generated content will just do really well in the AI era. ### AI systems can improve performance by **five big levers** ##### 1. Data Quality This is by far the most important thing to AI model performance. Ask anyone working on applied machine learning problems for a living and they will concur that access to high quality, dense and clean data is the largest moving lever for performance metrics. This is arguably the most crucial factor as high-quality, clean, and well-curated data has consistently shown to have the largest impact on model performance. The saying "garbage in, garbage out" is particularly relevant in AI systems. Even with optimal architecture and hyperparameters, poor data quality will severely limit model capabilities. ![[Pasted image 20241226155957.png]] My first role at IBM was about building low latency, [context agnostic convolutional neural networks](https://betterprogramming.pub/building-a-general-classification-system-for-image-quality-defects-beadbe026a19) to eliminate garbage data collected during image data sourcing so I have witnessed first hand about how important high quality data is. > What does it mean for your social network? The day you decide to go down the path of personalization, the most important thing would be the access to user attributes that allow you to personalize user journeys, content recommendation, notifications etc. This means having a strong data strategy from day one is crucial. This allows you to do both inferential statistics to figure what to build and predictive modelling to build consumer AI features that are really retentive. ##### 2. Model Architecture Frontier AI research will push LLMs research to become really great at the three major constraints of AI: 1. Latency 2. Benchmark performance 3. Energy efficiency While model architecture is important, it's becoming increasingly standardized across the industry. The key differentiator isn't the architecture itself but how well it can leverage the data on your platform. > What does it mean for your social network? A walled garden approach ensures that when you eventually do use advanced machine learning techniques across RecSys, NLP, Multi-armed Bandits, Hypothesis Testing etc, you will be uniquely positioned to perform better on vertically integrated AI usecase than any large company. It also allows unique optimizations across latency and internal performance benchmarks for LLMs, RecSys, Content Moderation and other value offerings. ##### 3. Data Scale Data scale during pre training has been very important. Model performance improves significantly as more data provides broader coverage of the population distribution. It’s one of the levers you can optimize for by ensuring that users create more content. This generally becomes less important at peak scale where any improvement can only mean marginal gains but when starting out, it is the air you need to breathe. ##### 4. Post-Training Post processing is an excellent way to ensure that the models behave in the intended manner in production. When you’re network scales, the models are constantly being trained on larger volumes of data. Access control becomes important at that time, you don’t want models to reveal data to unintended parties. Here RLHF, prompt augmentation and data access controls become important but are generally outside the scope of this topic. Just note that the context of the problem statement trumps any technical advantage here, since it’s usually a matter of taste and philosophy. ##### 5. Hyperparameters > a **hyperparameter** is a parameter that can be set in order to define any configurable part of a model’s learning process. Hyperparameter selection and tuning generally translates into least meaningful gains out of all the above mentioned levers. Consider it being able to unlock a few basis points worth of improvement but worth doing once you have settled on a model architecture that is performant. Hyperparameter Tuning is an active area of research. There are several key approaches here but [Optuna](https://optuna.org/)) works really well and uses Tree structured Parzen Estimator for fast convergence on Hyperparameters. --- Having a walled garden approach really helps defend your applied AI features while being deeply vertically integrated in your application layer. It also means that users unlock greater value while telemetry and user data gives you better insights around what users want from your platform. In the end, all companies will need to have a definitive plan on AI, social companies have the highest throughput of data streams and having a walled garden approach is a great way to curate user experiences while being deeply entrenched such that others can’t replicate your offerings.