Top Technical Skills: Python, R, SQL, Cypher (Neo4j)

Greetings! I'm Maria Aroca, a data scientist with an academic background in Political Science. Currently I’m a lead researcher at CIDEACC (Center for Development, Innovation, and Artificial Intelligence at Clinica de La Costa). I’m passionate about exploring the intersection between computational methods and the social sciences to create real-world applications for the exploration of clinical, social, and political data

About me

Areas of Interest

  • Implemented Neo4j and Cypher for constructing and querying a graph database on political power at 20 Moves and to model government procurement activity at Secretaria de Transparencia.

    Utilized Graph DBs in my doctoral dissertation to construct a graph of legislator voting behavior and legislative activity.

    Familiar with the Neo4j Python library for graph database integration and manipulation, and Neo4j’s APOC and Graph Data Science libraries.

  • Utilized python’s NetworkX, igraph, graphistry, Graph-tool, and R’s igraph, sna, statnet, visNetwork and r2d3, along with more commonly known libraries such as Scikit-learn for machine learning and TensorFlow for deep learning applications, in various research projects and professional roles to visualize, analyze and model networks.

    Acquired a theoretical foundation in Network Science through a dedicated course as part of my MS degree at Indiana University Bloomington

  • Applied NLP techniques using NLTK, spaCy, Gensim, CoreNLP and Hugging Face Transformers in various projects.


    Took dedicated courses in NLP for Data Science and Social Media Mining at Indiana University Bloomington.


    Engaged in projects utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), employing Python libraries like LangChain.

  • Developed ETL pipelines using Python’s Pandas and Numpy, web scraping libraries like BeautifulSoup, Scrapy, and Selenium, and large-scale data processing tools like Apache Spark and PySpark.

    Took dedicated courses as part of my MS degree at Indiana University Bloomington on Applied Database Technologies and Management Access and Use of Big Data, working with various SQL and NoSQL databases, including MySQL, SQLite, PostgreSQL, MongoDB, Apache Cassandra, Redis and Neo4j.

    Experience in building pipelines in cloud based ETL solutions, including Google Cloud Platform (GCP) BigQuery, Amazon Web Services (AWS) S3, Jetstream2, and Microsoft Azure Databricks.

  • Created data visualizations using Matplotlib, Seaborn, Altair, Bokeh, and Plotly in Python, as well as ggplot2, Plotly, and Shiny in R, to illustrate and communicate data insights effectively to a diverse audience.

    Took a dedicated data visualization course at Indiana University Bloomington.

  • Supported government transparency initiatives through my work at Secretaria de Transparencia and Congreso Visible.

    Familiar with python clients to access open-source Government datasets like sodapy, tools for interacting with API endpoints like Insomnia and Postman and other API querying tools like GraphQL.

  • Developed expertise in Legislative Institutions and Electoral Institutions as part of my PhD studies at Rice University.

  • I was a fellow at the 2024 Complexity Global School organized by Universidad de los Andes in Colombia, and the Santa Fe Institute (SFI) in the USA. I received training on the study of complex systems, Agent Based Models (ABMs), network analysis, scaling methods, and computational social science.


Education

  • M.S., Data Science - IU-Bloomington (Dec 2024)

  • Ph.D., Political Science - Rice University (May 2022)

  • M.A., Political Science - Universidad de los Andes (Aug 2014)

  • B.A., Political Science - Universidad de los Andes (Feb 2012)

Work Experience

Lead Researcher @ CIDEACC (September 2024 - Present)

  • Leading AI-driven research projects focusing on predictive diagnostics, personalized medicine, and optimizing clinical workflows through machine learning algorithms and data analytics.

  • Collaborating with clinicians and engineers to develop AI solutions for healthcare, including the application of natural language processing and computer vision to assist in medical diagnostics.

Data Scientist @ 20Moves (June 2023- August 2024)

  • Developed a knowledge graph with over 22 million nodes for 20 Moves, enabling complex analysis of New York's political landscape, enhancing strategic decision-making for social movements.

  • Implemented Python pipelines and Cypher queries in Neo4j, integrating unstructured, structured, API, web-scraped, and LLM-retrieved data into actionable insights, enhancing the organization's data analytics and application usability.

  • Utilized network analysis techniques to identify key influencers and pathways in political data, significantly enhancing the application's utility for navigating New York's political network.

Catedrática @ Universidad del Norte (July 2023 - Present)

  • Currently teaching "Electoral Systems and Political Participation" (POL4523) to undergraduates, incorporating practical skills in spreadsheet usage and quantitative data analysis.

Assistant Professor @ Universidad del Norte (January 2022 - July 2023)

  • As an Assistant Professor of Political Science, taught a diverse range of courses including political institutions, electoral systems, and research design which included lectures on programming in R, to both undergraduate and graduate students.

  • Created a software tool for the University's Office of Research to analyze and visualize faculty publication output, enhancing academic productivity analysis.

  • Mentored undergraduate and graduate students in independent research projects, fostering research skills in the field of political science.

Data Scientist @ Secretaría de Transparencia - Presidencia de la República de Colombia (November 2020- December 2021)

  • Designed and implemented ETL pipelines and reporting tools for portal.paco.gov.co, analyzing procurement activities of Colombian government entities, supporting government transparency. The production-grade code continues to function effectively.

  • Developed strategies to integrate machine learning and graph databases to detect potential corruption in government procurement processes.

  • The tool has become a critical resource for journalists, citizens, and government officials, used in high-profile cases to expose irregularities in procurement activities, demonstrating its significant impact on promoting transparency and accountability.

Research Assistant @ Congreso Visible - Universidad de los Andes (July 2010 - July 2014)

  • Collected, cleansed, and structured extensive datasets on legislative activities for the transparency initiative of Congreso Visible, aiding legislators, lobbyists, and citizens in accessing current information on the activity and members of the Colombian national legislature.

  • Utilized statistical analysis to produce detailed reports on legislative activity in the Colombian Congress.