(Day 271) DE - insight and advice from industry experts

Ivan Ivanov · September 28, 2024

theory gnn data-eng

Hello :) Today is Day 271!

A quick summary of today:

conversations with industry experts from the DE course by Joe Reis
starting Graph Algorithms for Data Science by Tomaz Bratanic

At the end of the DE specialisation course, there are 3 videos (total ~1hr) from industry experts that Joe Reis is interviewing that I did not realise it was there.

Conversation with Zach Wilson

Zach Wilson, a seasoned data engineer turned entrepreneur, shared his thoughts on the evolving landscape of data engineering, the challenges faced by newcomers, and how the role is expected to change in the coming years. Zach, who has worked with tech giants like Facebook, Netflix, and Airbnb, offered valuable insights from his decade-long career.

From Engineer to Entrepreneur

Zach’s career began in big tech, where he developed his skills as a data engineer. After nearly a decade working on cutting-edge projects, he transitioned from engineering roles to entrepreneurship. He now runs a bootcamp called DataExpert.io, aimed at helping aspiring data engineers break into the field. His career shift highlights the importance of continuous learning and adapting to new challenges in the tech industry.

Breaking into Data Engineering

One of the key points Zach emphasized is the difficulty of landing a junior data engineering role. Unlike software engineering or data analysis, entry-level positions in data engineering are relatively rare. This is primarily because companies often prefer experienced professionals to manage critical data pipelines. Zach recommends starting in adjacent roles, such as software engineering or data analysis, to build relevant skills. He shared his own experience of starting in a software engineering role focused on automation, which later helped him transition into data engineering.

The Skill Set of a Data Engineer

Zach highlighted the importance of mastering several key skills to succeed as a data engineer: proficiency in Python, Bash, SQL, and cloud technologies like BigQuery, Airflow, and Spark. He also stressed the importance of understanding Linux commands and being able to integrate various services effectively. However, beyond technical skills, Zach believes that knowing when not to build a pipeline is equally important—a sign of a mature and experienced data engineer.

Knowing When You’ve Made It

When asked how one knows they’ve become a good data engineer, Zach provided an interesting perspective. It’s not just about building pipelines efficiently; it’s also about understanding the bigger picture. A good data engineer thinks critically about why a pipeline is needed and its impact on the business. Zach recounted his own shift in mindset at Airbnb, where he learned to say no to unnecessary tasks, allowing him to focus on more valuable work and avoid burnout.

Career Advice for Aspiring Data Engineers

Zach offered valuable advice for those just starting out. He advised against the common belief that “your work will speak for itself.” Instead, he stressed the importance of marketing yourself and seeking feedback to improve. He also encouraged learning in-demand technologies like Apache Spark early in one’s career, as this can open up more opportunities and make a data engineer more competitive in the job market.

Keeping Skills Sharp and Growing in Your Career

To stay relevant in data engineering, Zach suggested continuously pushing the boundaries of the three V’s—volume, velocity, and variety. He shared his experience at Netflix, where he took on projects outside his job responsibilities to work on larger, more complex pipelines. This kind of initiative not only enhances one’s skills but also positions a data engineer as a valuable asset within a company.

The Future of Data Engineering

Looking ahead, Zach sees data engineering evolving in two significant ways. First, he believes that the roles of data engineers and backend engineers will increasingly merge, particularly in environments where service owners are responsible for their data. This hybrid role, which he refers to as “Software Engineer, Data,” is likely to become more common. Second, Zach predicts a rise in the importance of unstructured data, driven by advancements in AI and machine learning. However, he cautions that this will require new techniques to ensure data quality, especially in the face of challenges like LLM hallucination.

Conversation with Carly Taylor

Carly Taylor, a machine learning engineer at Activision working on Call of Duty, shared her remarkable journey into data science, her experiences in the field, and valuable career advice for those aspiring to enter the tech world. Carly’s story is one of determination, continuous learning, and adapting to new challenges—a journey that took her from a background in chemistry to the cutting-edge world of machine learning.

From Chemistry to Data Science

Carly’s entry into data science is a testament to the diverse paths one can take into this field. With a graduate degree in chemistry, Carly initially found it challenging to secure a job in her domain. However, as many of her peers began moving into the emerging field of data science, particularly in FinTech, she saw an opportunity to apply her quantitative skills in a new way. Intrigued by the potential and the exciting challenges data science offered, Carly made the leap, beginning her career as a marketing analyst despite having no prior knowledge of marketing.

This unconventional start highlights a key point Carly emphasized: sometimes, breaking into data science means leveraging the skills you have, even if you’re not a domain expert. Her technical background helped her land the role, and she learned the necessary marketing knowledge on the job. If she could do it over again, Carly noted she might have taken some business classes to ease the transition, but ultimately, her story illustrates the power of on-the-job learning and adaptability.

Life as a Machine Learning Engineer at Activision

Today, Carly works on the machine learning team at Activision, focusing on security for Call of Duty. Her role involves a mix of tasks, but much of her time, like many in similar positions, is spent on exploratory data analysis (EDA). Carly estimates that about 70% of her team’s time is dedicated to EDA, with the remaining 30% split between modeling and production tasks. The work is hands-on, data-driven, and requires constant adaptation to new challenges, especially in a fast-paced environment like gaming.

Interestingly, Carly explained that at her company, the lines between different roles—like data engineering, software engineering, and machine learning—are often blurred. This fluidity allows her team to be flexible, with engineers frequently stepping into roles that might not be strictly defined by their job titles. This versatility is a crucial skill in modern tech environments, where the ability to wear multiple hats can significantly enhance a team’s productivity and innovation.

Understanding the Difference: Data Engineering vs. Machine Learning Engineering

Carly offered her perspective on the distinction between data engineering and machine learning engineering, a topic often debated in tech circles. She explained that while data engineers focus on ensuring data integrity, availability, and pipeline reliability, machine learning engineers are tasked with understanding the data in a business context and building predictive models. The two roles complement each other, with data engineers ensuring that the data is clean and usable, while machine learning engineers apply that data to solve complex problems.

Climbing the Ladder: From Individual Contributor to Leader

Carly’s transition from individual contributor to team leader was not a planned move, but rather a response to a gap in leadership that needed filling. Her advice to those considering a move into management is to be open to stepping up when the situation calls for it, rather than feeling pressured to follow a strict career plan. Leadership, in her experience, often comes from recognizing a need and having the courage to fill it.

Advice to Her Younger Self and to Learners

Reflecting on her career, Carly shared the advice she would give her younger self: give yourself grace. Early in her career, she felt immense pressure to keep up with every new development in data science, leading to burnout. Over time, she realized the importance of focusing on a specialty, mastering it, and letting go of the need to know everything about everything. Carly introduced the concept of a T-shaped skill set—a broad understanding of many areas with deep expertise in one. This, she believes, is key to success in data science.

Standing Out in a Competitive Job Market

In today’s competitive job market, Carly emphasized the importance of going the extra mile, especially when applying for jobs. One emerging trend she discussed is the use of video submissions during the application process. While many candidates might resist this, Carly sees it as an opportunity to stand out. By putting in a bit more effort—whether it’s dressing well, speaking naturally, or ensuring good lighting—you can differentiate yourself from the majority who might approach these tasks with minimal effort.

Conversation with Ben Rogojan

In a recent interview, Ben Rogojan, widely known as the Seattle Data Guy, sat down to discuss his journey and insights into the world of data engineering. Ben, who has spent nearly a decade in the data industry, shared how he initially started with a strong focus on data science, influenced by the popularity of the field around 2012 to 2015. He adjusted his college courses to align more with bioinformatics and statistics, preparing for a career in data science. However, over time, Ben discovered a preference for the software and programming aspects of data work, which led him to accidentally stumble upon data engineering. His first official role as a data engineer came at a healthcare analytics startup, where he honed his skills in SQL, Python, automation, and data warehousing. This role set the foundation for his career, eventually leading him to work at Facebook before transitioning into consulting within the data engineering and data infrastructure space.

Early Career and Healthcare Analytics

During the interview, Ben discussed the nature of his work at the healthcare analytics company, where he was responsible for managing and standardizing large datasets from various insurance providers. The data, primarily healthcare claims, required meticulous standardization due to the differing formats in which it was received, even though the content was generally similar. Ben explained how this experience taught him about the various ways data can be formatted and the challenges associated with managing such complex information. His work at the company also involved building products focused on fraud detection and monitoring opioid distribution, all while working with tools like PowerShell and SQL Server, which he described as unglamorous but essential for the job.

Transition to Facebook and Mature Data Infrastructure

Ben’s move to Facebook marked a significant shift in his career, where he encountered a more mature data infrastructure. He noted that while Facebook’s data systems were advanced and often seen as unchallenging by some, the real challenge lay in identifying valuable problems to solve and effectively communicating within the organization. Ben emphasized the importance of not just being technically proficient but also understanding and addressing the business’s needs. He shared a valuable lesson from his time at Facebook: instead of merely completing tasks, data engineers should focus on identifying recurring issues and finding ways to eliminate them, thereby adding more value to the business.

Advice for Aspiring Data Engineers

When asked about advice for aspiring data engineers, Ben recommended starting with a solid foundation in technical skills such as SQL, Python, and data warehousing. However, he also stressed the importance of understanding the business context of these technologies. Ben encouraged engineers to ask critical questions about the purpose of the systems they are building and the real impact on the business. He acknowledged the challenge of learning about the business side of things, especially early in one’s career, but emphasized that it becomes increasingly important as engineers advance in their roles.

Future of Data Engineering and Industry Insights

Looking to the future, Ben expressed hope that the industry would continue to focus on the fundamentals of data management and quality, especially as interest in AI continues to grow. He pointed out the challenges companies face in managing their diverse data infrastructures and the importance of implementing reliable and usable datasets over the long term. Ben remains optimistic that with a renewed focus on these basics, the industry can better leverage AI and other advanced technologies.

Actually, I listened to these talks in the morning, but after that I saw that for one of my courses - Research Ethics and Methodology, I had a lot of assignments to do because I joined a bit late and I wanted to cover the recorded lectures and mini-assignments. So lunch-evening I was watching lectures and doing the mini-assignments related to some research writing and methodology basics. One of the mini-assignments was ‘Why did I choose the title for my thesis?’. And my title at the moment is ‘Using Video Generation Models over Graph Neural Networks for Taxi OD Demand Matrix Prediction’, and I had to write something like: This title highlights key topics like video generation models, graph neural networks, and taxi OD demand matrices, making it easier for the paper to be found in search engines and giving readers a clear idea of its content right away. A well-crafted title can also capture a reader’s interest, helping them decide whether the paper is worth their time. Additionally, this title reflects an innovative approach to tackling real-world traffic prediction problems by comparing video generation techniques with advanced graph neural networks.

Graph Algorithms for Data Science by Tomaz Bratanic

This is the 1st thing in the book

So there is a Graph DS role now ~ cool

This book has been on my to-read list for a bit so I’m excited to read it in the coming days (maybe even on stream)

Chapter 1 Graphs and network science: An introduction

If you have ever done any analysis, you have probably used a table representation of data, like an Excel spreadsheet or SQL database. Additionally, if you are dealing with large numbers of documents, you might have used parquet format or JSON-like objects to represent the data.

Sometimes, the relationships are explicitly specified between data points, like a person and their relationships with friends; and in others, the relationships may be indirect or implicit, like when there is a correlation between data points that could be calculated. Whether relationships are explicit or implicit, they contain the additional context of data points that can significantly improve the analysis output. The image below shows a small instance of a graph, which consists of five nodes connected by five relationships. Nodes or vertices can be used to represent various real-world entities, such as persons, companies, or cities; however, you can also use nodes to depict various concepts, ideas, and words.

Understanding data through relationships

Graph visualizations are valuable for gaining insights but can lose effectiveness with large datasets. When faced with vast amounts of data, graph algorithms can be utilized to derive insights or identify significant parts of the graph for further exploration through visualization. For example, in a large social network like Facebook, visualizing the entire structure is challenging, but graph algorithms can reveal key user groups or important nodes that influence information flow.

Graph algorithms, such as community detection, help identify groups of closely connected nodes, similar to friend groups in social networks. These communities can be used for content recommendations, marketing strategies, or understanding the graph’s structure. Centrality algorithms identify the most important nodes, like social media influencers, who control information flow between different communities.

These insights can enhance machine learning models by incorporating information about node positions and roles, improving predictive accuracy. Graph-based machine learning leverages these connections, using features that describe a node’s network role, aiding in tasks like node classification, regression, and link prediction.

How to spot a graph-shaped problem

Self-referencing relationships

The first scenario deals with self-referencing relationships between entities of the same type. In relational databases, a self-referencing relationship occurs between data points within the same table. The self-referencing relationships can be modeled as a single node and relationship type in a graph.

Graph algorithms are domain agnostic, meaning they can be applied across various fields regardless of what the nodes represent—be it people, services, or other entities. This versatility allows the same algorithms to be used in diverse scenarios, such as:

Bank transaction networks for fraud detection.
Service dependency graphs to assess vulnerability spread and defense against cyberattacks.
Supply chain optimization
Social media analysis
Managing complex telecommunication networks
Cybersecurity

These applications highlight the broad utility of graph algorithms in optimizing and securing complex systems.

Pathfinding networks

Another fairly common graph scenario is discovering paths or routes between entities or locations - like finding the most optimal route when travelling.

Traditional databases can struggle with queries that require traversing an unknown number of relationships between nodes, leading to complex and computationally expensive operations. By treating your data as a graph, we can more efficiently manage these challenges.

Finding optimal routes using graph algorithms is beneficial in several scenarios, including:

Logistics and routing
Infrastructure management
Identifying optimal paths for making new contacts
Payment routing

Using a graph-based approach can simplify and optimize these processes.

Bipartite graphs

Another compelling use case for graphs is examining indirect or hidden relationships. The graphs in the previous two scenarios had only one node type. On the other hand, a bipartite graph contains two node types and a single relationship type.

Complex networks

The last example is a complex network with many connections between various entities; a biomedical graph is one such scenario

Explaining all the medical terminology behind medical entities and their relationships could be the subject of a whole book. However, we can observe that biomedical concepts are highly interconnected and therefore an excellent fit for a graph model. Biomedical graphs are popular for representing existing knowledge and making predictions about new treatments, side effects, and more.

That is all for today!

See you tomorrow :)