[Chaoss-software] [Meeting item] Collaboration of projects within the Software TC

Fri Oct 20 22:31:00 UTC 2017

On Thu, 2017-10-19 at 10:24 -0500, Sean Goggins wrote:
> I have a few comments inline. 

Thanks a lot, Sean. Very useful your views. See my own comments inlined
too.

> > On Oct 19, 2017, at 2:21 AM, Daniel Izquierdo
> > <dizquierdo at bitergia.com> wrote:
> > 
> > Hi Jesús,
> > 
> > Thanks for this set of emails. Some comments in line.
> > 
> > On 18/10/17 00:49, Jesus M. Gonzalez-Barahona wrote:
> > > As I commented in the thread proposing our kick-off meeting as
> > > the
> > > CHAOSS TC, I'm going to start email threads with the proposed
> > > topics.
> > > Let's see if this works. If it doesn't work, I'll announce a time
> > > slot
> > > for a synchronous meeting next week.
> > > 
> > > This is the first item that I proposed to discuss:
> > > 
> > > * Item:
> > > 
> > > According to our charter [1], we should "produce integrated, open
> > > source software for analyzing software development". So, we
> > > should
> > > discuss how to start working in this direction.
> > > 
> > > [1] https://chaoss.community/about/governance/
> > > 
> > > * Discussion:
> > > 
> > > We have now three projects in the CHAOSS Software TC: Prospector,
> > > GrimoireLab, cregit. During the conversations that lead to the
> > > launch
> > > of CHAOSS, we decided that, at least for a start, the idea was to
> > > have
> > > GrimoireLab a the "glue" for all the projects, so that they would
> > > interoperate, at least to some extent, via GrimoireLab.
> 
> I think this conversation took place in a subgroup that was not on
> the list. Its also possible i missed it, but I think it will be
> helpful for the onboarding of new contributors if we provide some
> kind of clear road map. 

This was previous to the creation of the list. That's why I wanted to
make it explicit here. Thanks for the heads up, Sean.

> > > In this regard, Prospector is already integrated, since it was
> > > ported
> > > to use GrimoireLab/Perceval for data retrieval when it was
> > > updated to
> > > newer versions of its dependencies.
> 
> OK, so from a deployment perspective, do we have two projects or one
> at this time? 

We proposed that Prospector be integrated with GrimoireLab at the
development level. In this previous conversations that I mentioned, the
original authors of Prospector preferred that, for the moment, it was
considered a separate project. There are ongoing conversations to re-
consider the integration, but for now with no results (but also no
opposition), as far as I know. That's why for now, at least formally,
they are two different projects, and Prospector is not under the
GrimoireLab GitHub organization.

> > > WRT cregit, I've talked to Daniel German about using a new
> > > Perceval
> > > backend to extract the information it produces, and then showing
> > > it
> > > GrimoireLab dashboards. In fact, I have a Perceval backend wrote
> > > that,
> > > improved, could do the trick. But i need to find some time to
> > > update
> > > and improve it.
> 
> It seems like working on Perceval is a priority then? Should we have
> different “contributing” document sections in the repository for
> different layers?  For example, if we agree that perceval is our
> “back end”, presumably, then, there are also possibly “Web service
> /REST API” contributions and “front end” contributions needed.  

For producing a GrimoireLab dashboard showing information produced by
cregit in a meaningful way, my impression is that the way to go is
producing this Perceval backend. Please see a message I just sent about
the reasons.

WRT the structure of GrimoireLab, Perceval is the component retrieving
information from data sources (in this case, likely, the output from
"git blame", or similar), and producing JSON documents (technically,
Python dictionaries) with the retrieved data, each document
corresponding to a retrieved item.

>From that point on, you can do different things with the documents
corresponding to the items in the data sources:

* Consume them from a Python script, to compute metrics or whatever.

* Store them in a database for later consumption. In GrimoireLab, we
have a component that stores them in ElasticSearch as "raw indexes".

* If you store them in the database, then you can produce all kinds of
data from them using scripts in language that can access the database.
We have Python components for producing PDF reports, CSV tables and
Kibana dashboards based on that.

For more info on how to consume data produced by Perceval:

https://grimoirelab.gitbooks.io/training/content/perceval/perceval.html

https://grimoirelab.gitbooks.io/training/content/perceval/git.html

https://grimoirelab.gitbooks.io/training/content/perceval/github.html

(in the last two links, go to the bottom of the page, where you see
Python scripts).

For more info on how to develop Perceval backends:

https://grimoirelab.gitbooks.io/training/content/internals/perceval-backends.html

Maybe we should organize this information in a kind of a CONTRIBUTING
document?

> Across all contributions I think we need to be clear about how
> specific “activity level metrics” and integrated views/calculations
> across activity level metrics reflect (or in some cases perhaps
> deviate) from the Metrics committee definitions. 

The only place where you have computation of metrics is at the very
end, when you exploit the data in enriched indexes. All of this about
retrieving / storing data is more about how to efficiently have a
continuous pipeline of data coming from data sources than anything
else.

But yes, I completely agree that we have to work a lot in the
relationship of the metrics produced by GrgimoireLab dashboards and
those defined by the Metrics TC.

> > > Then, I would like to find ways of including other projects,
> > > which
> > > could cover areas not already covered. Since GrimoireLab produces
> > > comprehensive databases with a lot of data from the original
> > > repositories, this should be easy. Any idea in this respect is
> > > welcome.
> 
> The ghdata project is keenly interested in sharing data providers and
> ultimately contributing code directly to this stack. Since the
> project is intended as an exploration ground and not a durable
> product, code may from time to time migrate into this project if it
> shows utility.  Jesus and i have started working out how that might
> happen. 

Yes, indeed.

> > I'd say that we should produce some kind of on boarding guidelines.
> > This typically helps people to understand where to start from
> > several points of view.
> > 
> > For instance,
> > 
> > * What do I need to do if I want to integrate a non-supported data
> > source?
> 
> Write the code for perceval to support it? 

This could be one way of doing it.

Usually I like to see it the other way around: what can GrimoireLab do
to help you. That's why I wrote

https://grimoirelab.gitbooks.io/training/content/grimoirelab/intro/scenarios.html

Everywhere where you see a yellow box saying "script", is a "connecting
point" for any software willing to benefit from the GrimoireLab
architecture.

But I understand that in many cases, there is an already working piece
of software, and it would be more a matter of seeing how to make it
"connect" to the architecture. In that case, I agree with you, Sean:
one of the more clear cases would be connecting with a Perceval
component. Another one would be connecting with one of the databases
(either producing or consuming data).

> >  + First, this developer needs to check if that data source is not
> > currently supported
> 
> Presumably we could create a list and reference it in the
> ‘contributing” or “read me” files? 

Yes, that's a good idea. Now that's in Perceval, but maybe we should
make that more explicit in a CONTRIBUTING document. Right now, the list
is here:

https://github.com/grimoirelab/perceval/blob/master/README.md

> >  + Then, the developer should start in some place: a new Perceval
> > backend? directly creating a new ElasticSearch index?
> > 
> >  + How should I define a new ElasticSearch index? are there
> > guidelines? recommendations?
> 
> can we treat both like service providers wrapped in something we call
> “perceval”, which i acknowledge may not actually be what *is*
> perceval today? 

I like the idea of considering both scenarios as a kind of "service
providers". In the first case it would be Perceval itself, providing
the service of access to the data source API. In the second one it
would be the ElasticSearch database, providing the service of accessing
the data retrieved from a data source.

> There is also an specific case that maybe we could consider,
> > > which is
> > > ghData [2]. Since it is being actively used by the Metrics TC, it
> > > would
> > > be specially interesting to find ways of integrating it with
> > > GrimoireLab. Sean and me talked briefly about this in LA, and
> > > maybe we
> > > can try to follow the discussion.
> > > 
> > > [2] https://github.com/OSSHealth/ghdata
> 
> See this project context document for ghdata … we’re trying to make
> it clear “what the project is”, as it emerged as part of sorting out
> how to work together to define metrics, and in the process of forming
> CHAOSS … The focus is on human centered design, and the work is we
> think highly transferrable into CHAOSS … https://github.com/OSSHealth
> /ghdata/blob/dev/ghdataContext.md

I like very much the way the information produced by ghData is
provided, and that's one of the reasons why I would like to help to
port it somehow to GrimoireLab (or the other way around, if you
prefer). I would like to be capable of offering ghData, or something
evolved from it, to exploit the data gathered by GrimoireLab.

> A number of links related to Grimorelab have also been shared
> previously, and are pasted here for convenience: 
> 
> https://grimoirelab.gitbooks.io/training/content/perceval/git.html
> 
> https://grimoirelab.github.io/
> 
> https://grimoirelab.gitbooks.io/training/content/cases-chaoss.html
> 
> https://grimoirelab.gitbooks.io/training/content/cases-chaoss/activit
> y.html
> 

> > This would be a great example of how to integrate things and may
> > help to start that on boarding guideline.
> > 
> > Regards,
> > Daniel.
> > 
> > > As I understand it, currently ghData gets data from GHTorrent and
> > > GitHub. Maybe one step to walk would be to explore to which
> > > extent we
> > > could have a Perceval backend to query git, GitHub or other data
> > > sources not currently supported. Or interfacing directly to the
> > > GrimoireELK database. (for a brief explanation of the role of
> > > Perceval
> > > and GrimoireELK in GrimoireLab, please have a look at [3] [4]
> > > [5]).
> > > 
> > > [3] https://grimoirelab.gitbooks.io/training/grimoirelab/intro.ht
> > > ml
> > > [4] https://grimoirelab.gitbooks.io/training/grimoirelab/intro/co
> > > mponents.html
> > > [5] https://grimoirelab.gitbooks.io/training/grimoirelab/intro/sc
> > > enarios.html
> > > 
> > > Any comments on any of this?
> > > 
> > > Saludos,
> > > 
> > > 	Jesus.
> > > 
> > 
> > -- 
> > Daniel Izquierdo Cortazar, PhD
> > Chief Data Officer
> > ---------
> > "Software Analytics for your peace of mind"
> > www.bitergia.com
> > @bitergia
> > 
> > _______________________________________________
> > Chaoss-software mailing list
> > Chaoss-software at lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/chaoss-software
> 
> Sean P. Goggins
> Associate Professor, Computer Science
> Director, Data Science and Analytics Masters Program
> University of Missouri
> http://www.seangoggins.net
> 
> Computer Science:  http://engineering.missouri.edu/cs/ 
> Data Science & Analytics: http://dsa.missouri.edu 
> MU Informatics Institute http://muii.missouri.edu 
> visit: http://www.sociotech.net
> visit: http://osshealth.io (for ghdata OSS Metrics
> Software) [Sloan Foundation]
> visit: http://chaoss.community (for open source health
> metrics) [Sloan Foundation]
> visit: http://mhs.missouri.edu (for mission hydro sci!) [i3 & IES]
> visit: http://ocdx.io (for the open collaboration data exchange!) 
>  [National Science Foundation]
> visit: http://sociallycompute.io (for code like things
> and Group Informatics) [National Science Foundation]
>  
> "It may be that openness is a bad choice for communities, but it's
> a great choice for groups that want to span, not colonize. Span,
> not colonize. Include, not exclude. Learn from, not teach at."
> -- Steve Sawyer with Tony Salvador 
> 
> "The most effective way to do it, is to do it."
> -- Amelia Earhart
> 
> ‌‌
> 
> _______________________________________________
> Chaoss-software mailing list
> Chaoss-software at lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/chaoss-software
-- 
Bitergia: http://bitergia.com
/me at Twitter: https://twitter.com/jgbarah